Status Report: Docker Toolbox failures on Windows 10

TL;DR

A growing body of experience building containers on Windows 10 systems (using the Docker Toolbox for Windows) indicates that Docker Toolbox for Windows (on Win10 at minimum) is not a supportable combination for our project at present.  It appears that the command in docker-compose.yml cannot be found – at least in cases where the called script is stored in a directory inside the container.

While there may be some adjustment that could be made to support both Windows and *NIX hosts in this scenario, it’s yet another incompatibility we’ve encountered (among many) that only crops up when using Windows as the host, and our team has to focus its cycles on supporting the majority of our project’s developers (who aren’t on Windows).

Problem

One of my colleagues on the project reported the following error when bringing up the Docker container from this branch in this repo.  He’s using Windows 10, Docker Toolbox for Windows:

user@DESKTOP MINGW64 /c/develop/python/team-budget/budget_proj (dockerize)
$ docker-compose up
Starting budgetproj_web_1

ERROR: for web Cannot start service web: oci runtime error: container_linux.go:247: starting container process caused "exec: \"/code/docker-entrypoint.sh\": stat /code/docker-entrypoint.sh: no such file or directory"
ERROR: Encountered errors while bringing up the project.

I explained that as far as I understand Docker engine and Docker Toolbox (which hosts the engine in a Virtualbox VM), what’s going on is that inside the container, Docker is trying to execute /code/docker-entrypoint.sh – so theoretically there should be no reason why this would behave any differently on Windows than Mac or Linux, since the runtime environment inside the Docker container shouldn’t know anything about its underlying host’s environment.  I know for sure it’s working well on Mac and Linux, even on my personal Mac that’s running the Docker Toolbox.

Investigation

Budget repo on Windows: fails

I attempted this myself with the same branch/repo using Docker Toolbox for Windows (downloaded today, running v1.13.1 of Docker engine) on Windows 10 (Anniversary update), and received effectively the same result:

...

Step 12/12 : WORKDIR /code
 ---> 37cb2ce39964
Removing intermediate container af6440be2e49
Successfully built 37cb2ce39964
WARNING: Image for service web was built because it did not already exist. To rebuild this image you must use `docker-compose build` or `docker-compose up --build`.
Creating budgetproj_web_1
Attaching to budgetproj_web_1
web_1  | standard_init_linux.go:178: exec user process caused "no such file or directory"
budgetproj_web_1 exited with code 1

Budget repo on Mac: succeeds

The same commit from the same branch/repo on OS X 10.11 with Docker Toolbox for Mac:

...

Step 12/12 : WORKDIR /code
 ---> 0da697ffe35c
Removing intermediate container d4df9c99e8f9
Successfully built 0da697ffe35c
Recreating budgetproj_web_1
Attaching to budgetproj_web_1
web_1 | Operations to perform:
web_1 | Apply all migrations: admin, auth, budget_app, contenttypes, sessions
web_1 | Running migrations:
web_1 | Applying contenttypes.0001_initial... OK
web_1 | Applying auth.0001_initial... OK
web_1 | Applying admin.0001_initial... OK
web_1 | Applying admin.0002_logentry_remove_auto_add... OK
web_1 | Applying contenttypes.0002_remove_content_type_name... OK
web_1 | Applying auth.0002_alter_permission_name_max_length... OK
web_1 | Applying auth.0003_alter_user_email_max_length... OK
web_1 | Applying auth.0004_alter_user_username_opts... OK
web_1 | Applying auth.0005_alter_user_last_login_null... OK
web_1 | Applying auth.0006_require_contenttypes_0002... OK
web_1 | Applying auth.0007_alter_validators_add_error_messages... OK
web_1 | Applying auth.0008_alter_user_username_max_length... OK
web_1 | Applying budget_app.0001_initial... OK
web_1 | Applying budget_app.0002_auto_20170221_0359... OK
web_1 | Applying sessions.0001_initial... OK
web_1 | [2017-03-01 21:42:05 +0000] [10] [INFO] Starting gunicorn 19.6.0
web_1 | [2017-03-01 21:42:05 +0000] [10] [INFO] Listening at: http://0.0.0.0:8000 (10)
web_1 | [2017-03-01 21:42:05 +0000] [10] [INFO] Using worker: sync

So I tried this on a couple of other of our organization’s projects.

Housing repo on Windows: fails

Housing-17 under Docker Toolbox for Windows on Windows 10:

$ docker-compose up --build
Building web
Step 1/6 : FROM python:3.5
 ---> 4e5ed9f6613e
Step 2/6 : ENV PYTHONUNBUFFERED 1
 ---> Using cache
 ---> a62a6ae73cec
Step 3/6 : ADD ./requirements.txt /provision/
 ---> Using cache
 ---> 9f34a7d35294
Step 4/6 : WORKDIR /provision/
 ---> Using cache
 ---> 4e06b4c2249f
Step 5/6 : RUN pip install -r requirements.txt
 ---> Using cache
 ---> e96a581fc549
Step 6/6 : WORKDIR /code/
 ---> Using cache
 ---> fc61cc36c06f
Successfully built fc61cc36c06f
Starting housing17_db_1
Starting housing17_web_1
Attaching to housing17_db_1, housing17_web_1
db_1   | LOG:  database system was shut down at 2017-03-01 19:38:08 UTC
db_1   | LOG:  MultiXact member wraparound protections are now enabled
db_1   | LOG:  database system is ready to accept connections
web_1  | standard_init_linux.go:178: exec user process caused "no such file or directory"
housing17_web_1 exited with code 1

Housing repo on Mac: succeeds

Housing-17 under Docker Toolbox for Mac on OS X 10.11:

...

Step 6/6 : WORKDIR /code/
 ---> Using cache
 ---> f79cbc2964cb
Successfully built f79cbc2964cb
Starting housing17_db_1
Starting housing17_web_1
Attaching to housing17_db_1, housing17_web_1
db_1 | LOG: database system was shut down at 2017-03-01 03:26:27 UTC
db_1 | LOG: MultiXact member wraparound protections are now enabled
db_1 | LOG: database system is ready to accept connections
web_1 | 
web_1 | 0 static files copied to '/code/static', 126 unmodified.
web_1 | [2017-03-01 21:50:00 +0000] [7] [INFO] Starting gunicorn 19.6.0
web_1 | [2017-03-01 21:50:00 +0000] [7] [INFO] Listening at: http://0.0.0.0:8000 (7)
web_1 | [2017-03-01 21:50:00 +0000] [7] [INFO] Using worker: sync

Emergency_response repo on Windows: fails

Emergency-response-backend under Docker Toolbox for Windows on Windows 10:

... 

---> d1dece959fab
Removing intermediate container 446d7dae0532
Step 14/14 : COPY . /code/
 ---> d12ccbed9557
Removing intermediate container 27809cb3988a
Successfully built d12ccbed9557
WARNING: Image for service web was built because it did not already exist. To rebuild this image you must use `docker-compose build` or `docker-compose up --build`.
Creating emergencyresponsebackend_web_1
Attaching to emergencyresponsebackend_web_1
web_1  | standard_init_linux.go:178: exec user process caused "no such file or directory"
emergencyresponsebackend_web_1 exited with code 1

Emergency_response repo on Mac: succeeds/fails (but for an application-specific reason)

Emergency-response-backend under Docker Toolbox for Mac on OS X 10.11:

Step 14/14 : COPY . /code/
 ---> bc4a1bd8e372
Removing intermediate container e9b079ea31da
Successfully built bc4a1bd8e372
WARNING: Image for service web was built because it did not already exist. To rebuild this image you must use `docker-compose build` or `docker-compose up --build`.
Creating emergencyresponsebackend_web_1
Attaching to emergencyresponsebackend_web_1
web_1 | Traceback (most recent call last):
web_1 | File "manage.py", line 10, in <module>
web_1 | execute_from_command_line(sys.argv)
web_1 | File "/usr/local/lib/python3.4/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
web_1 | utility.execute()
web_1 | File "/usr/local/lib/python3.4/site-packages/django/core/management/__init__.py", line 316, in execute
web_1 | settings.INSTALLED_APPS
web_1 | File "/usr/local/lib/python3.4/site-packages/django/conf/__init__.py", line 53, in __getattr__
web_1 | self._setup(name)
web_1 | File "/usr/local/lib/python3.4/site-packages/django/conf/__init__.py", line 41, in _setup
web_1 | self._wrapped = Settings(settings_module)
web_1 | File "/usr/local/lib/python3.4/site-packages/django/conf/__init__.py", line 97, in __init__
web_1 | mod = importlib.import_module(self.SETTINGS_MODULE)
web_1 | File "/usr/local/lib/python3.4/importlib/__init__.py", line 109, in import_module
web_1 | return _bootstrap._gcd_import(name[level:], package, level)
web_1 | File "<frozen importlib._bootstrap>", line 2254, in _gcd_import
web_1 | File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
web_1 | File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
web_1 | File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
web_1 | File "<frozen importlib._bootstrap>", line 1129, in _exec
web_1 | File "<frozen importlib._bootstrap>", line 1471, in exec_module
web_1 | File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
web_1 | File "/code/emergency_response_api/settings.py", line 16, in <module>
web_1 | from . import project_config
web_1 | ImportError: cannot import name 'project_config'
web_1 | Traceback (most recent call last):
web_1 | File "manage.py", line 10, in <module>
web_1 | execute_from_command_line(sys.argv)
web_1 | File "/usr/local/lib/python3.4/site-packages/django/core/management/__init__.py", line 367, in execute_from_command_line
web_1 | utility.execute()
web_1 | File "/usr/local/lib/python3.4/site-packages/django/core/management/__init__.py", line 316, in execute
web_1 | settings.INSTALLED_APPS
web_1 | File "/usr/local/lib/python3.4/site-packages/django/conf/__init__.py", line 53, in __getattr__
web_1 | self._setup(name)
web_1 | File "/usr/local/lib/python3.4/site-packages/django/conf/__init__.py", line 41, in _setup
web_1 | self._wrapped = Settings(settings_module)
web_1 | File "/usr/local/lib/python3.4/site-packages/django/conf/__init__.py", line 97, in __init__
web_1 | mod = importlib.import_module(self.SETTINGS_MODULE)
web_1 | File "/usr/local/lib/python3.4/importlib/__init__.py", line 109, in import_module
web_1 | return _bootstrap._gcd_import(name[level:], package, level)
web_1 | File "<frozen importlib._bootstrap>", line 2254, in _gcd_import
web_1 | File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
web_1 | File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
web_1 | File "<frozen importlib._bootstrap>", line 1200, in _load_unlocked
web_1 | File "<frozen importlib._bootstrap>", line 1129, in _exec
web_1 | File "<frozen importlib._bootstrap>", line 1471, in exec_module
web_1 | File "<frozen importlib._bootstrap>", line 321, in _call_with_frames_removed
web_1 | File "/code/emergency_response_api/settings.py", line 16, in <module>
web_1 | from . import project_config
web_1 | ImportError: cannot import name 'project_config'
...

(Note: the runtime error in “manage.py” indicates that the command in docker-compose.yml was executed correctly, and there’s an app-specific issue in running one of the commands inside the docker-entrypoint.sh script that docker-compose.yml specifies.)

Potentially related issues re: Docker running on Windows

Absolute paths change with git bash on Windows

Docker Build Image fails with ‘NOT FOUND’ executing…

Bash script always prints command not found (perhaps a missing ‘execute’ bit, that only affects commands running in a Docker container when running on Docker Toolbox for Windows?)

 

Advertisements

What I’ve learned: setting up Bash/Ubuntu/Win10 for Ansible + Vagrant + VirtualBox

My Goal: test the use of this Ansible Role from Windows 10, using a combination of Windows and Bash for Ubuntu on Windows 10 tools.  Favour the *nix tools wherever possible, for maximum compatibility with the all-Linux production environment.

Preconditions

Here is the software/shell arrangement that worked for me in my Win10 box:

  • Runs in Windows: Virtualbox, Vagrant
  • Runs in Bash/Ubuntu: Ansible (in part because of this)

In this setup, I’m using a single Virtualbox VM in default network configuration, whereby Vagrant ends up reporting the host listening on 127.0.0.1 and SSH listening on TCP port 2222.  Substitute your actual values as required.

Also note the versions of software I’m currently running:

  • Windows 10: Anniversary Update, build 14393.51
  • Ansible (*nix version in Bash/Ubuntu/Win10): 1.5.4
  • VirtualBox (Windows): 5.0.26
  • Vagrant (Windows): 1.8.1

Run the Windows tools from a Windows shell

  • C:\> vagrant up
  • (or launch a Bash shell with cbwin support:  C:\>outbash, then try running /mnt/c/…/Vagrant.exe up from the bash environment)

Start the Virtualbox VMs using Vagrant

  • Vagrant (Bash) can’t just do vagrant up where VirtualBox is installed in Windows – it depends on being able to call the VBoxManage binary
    • Q: can I trick Bash to call VBoxManage.exe from /mnt/c/Program Files/Oracle/VirtualBox?
    • If not, is it worth messing around with Vagrant (Bash)?  Or should I relent and try Vagrant (Windows), either using cbwin or just running from a different shell?
  • Vagrant (Windows) runs into the fscking rsync problem (as always)
    • Fortunately you can disable rsync if you don’t need the sync’d folders
    • Disabling the synced_folder requires editing the Vagrantfile to add this in the Vagrant.configure section:
      config.vm.synced_folder “.”, “/vagrant”, disabled: true

Setup the inventory for management

  • Find the IP’s for all managed boxes
  • Organize them (in one group or several) in the /etc/ansible/hosts file
  • Remember to specify the SSH port if non-22:
    [test-web]
    127.0.0.1 ansible_ssh_port=2222
    # 127.0.0.1 ansible_port=2222 when Ansible version > 1.9
    • While “ansible_port” is said to be the supported parameter as of Ansible 2.0, my own experience with Ansible under Bash on Windows was that ansible wouldn’t connect properly to the server until I changed the inventory configuration to use “ansible_ssh_port”, even though ansible –version reported itself as 2.1.1.0
    • Side question: is there some way to predictably force the same SSH port every time for the same box?  That way I can setup an inventory in my Bash environment and keep it stable.

Getting SSH keys on the VMs

  • (Optional: generate keys if not already) Run ssh-keygen -t rsa
  • (Optional: if you’ve destroyed and re-generated the VM with vagrant destroy/up, wipe out the existing key for the host:port combination by running the following command that is recommended when ssh-copy-id fails): ssh-keygen -f “/home/mike/.ssh/known_hosts” -R [127.0.0.1]:2222
  • Run ssh-copy-id vagrant@127.0.0.1 -p 2222 to push the public key to the target VM’s vagrant account

Connect to the VMs using Ansible to test connectivity

  • [from Windows] vagrant ssh-config will tell you the IP address and port of your current VM
  • [from Bash] ansible all -u vagrant -m ping will check basic Ansible connectivity
    • (ansible all -c local -m ping will go even more basic, testing Ansible itself)

Run the playbook

  • Run ansible-playbook [playbook_name.yml e.g. playbook.yml] -u vagrant
    • If you receive an error like “SSH encountered an unknown error” with details that include “No more authentication methods to try.  Permission denied (publickey,password).”, make sure to remember to specify the correct remote user (i.e. one that trusts your SSH key)
    • If you receive an error like “stderr: E: Could not open lock file /var/lib/dpkg/lock – open (13: Permission denied)”, make sure your remote user runs with root privilege – e.g. in the [playbook.yml], ensure sudo: true is included
  • Issue: if you receive an error like “fatal: [127.0.0.1]: UNREACHABLE! => {“changed”: false, “msg”: “Failed to connect to the host via ssh.”, “unreachable”: true}”, check that your SSH keys are trusted by the remote user you’re using (e.g. “-u vagrant” may not have the SSH keys already trusted)
  • If you wish to target a subset of servers in your inventory (e.g. using one or more groups), add the “-l” parameter and name the inventory group, IP address or hostname you wish to target
    e.g. ansible-playbook playbook.yml -u vagrant -l test-web
    or ansible-playbook playbook.yml -u vagrant -l 127.0.0.1

Protip: remote_user

If you want to stop having to add -u vagrant to all the fun ansible commands, then go to your /etc/ansible/ansible.cfg file and add remote_user = vagrant in the appropriate location.

Rabbit Hole Details for the Pedantically-Inclined

03ec8fe3bb146924423af6381eb99ea9

Great Related Lesson: know the difference between vagrant commands

  • Run vagrant ssh to connect to the VM [note: requires an SSH app installed in Windows, under this setup]
  • Run vagrant status to check what state the VM is in
  • Run vagrant reload to restart the VM
  • Run vagrant halt to stop the VM
  • Run vagrant destroy to wipe the VM

Ansible’s RSA issue when SSH’ing into a non-configured remote user

  • The following issue occurs when running ansible commands to a remote SSH target
    e.g. ansible all -m ping
  • This occurs even when the following commands succeed:
    • ansible -c local all -m ping
    • ssh vagrant@host.name [port #]
    • ssh-copy-id -p [port #] vagrant@host.name
  • Also note: prefixing with “sudo” doesn’t seem to help – just switches whose local keys you’re using
  • I spent the better part of a few hours (spaced over two days, due to rage quit) troubleshooting this situation
  • Troubleshooting this is challenging to say the least, as ansible doesn’t intelligently hint at the source of the problem, even though this must be a well-known issue
    • There’s nothing in the debug output of ssh/(openssl?) that indicates that there are no trusted SSH keys in the account of the currently-used remote user
    • Nor is it clear which remote user is being impersonated – sure, I’ll bet someone that fights with SSH & OpenSSL all day would have noticed the subtle hints, but for those of us just trying to get a job done, it’s like looking through foggy glass
  • Solution: remember to configure the remote user under which you’re connecting (i.e. a user with the correct permissions *and* who trusts the SSH keys in use)
    • Solution A: add the -u vagrant parameter
    • Solution B: specify remote_user = vagrant in the ansible.cfg file under [defaults]
mike@MIKE-WIN10-SSD:~/code/ansible-role-unattended-upgrades$ ansible-playbook role.yml -vvvv

PLAY [all] ********************************************************************

GATHERING FACTS ***************************************************************
<127.0.0.1> ESTABLISH CONNECTION FOR USER: mike
<127.0.0.1> REMOTE_MODULE setup
<127.0.0.1> EXEC ['ssh', '-C', '-tt', '-vvv', '-o', 'ControlMaster=auto', '-o', 'ControlPersist=60s', '-o', 'ControlPath=/home/mike/.ansible/cp/ansible-ssh-%h-%p-%r', '-o', 'Port=2222', '-o', 'KbdInteractiveAuthentication=no', '-o', 'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey', '-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10', '127.0.0.1', "/bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1471378875.79-237810336673832 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1471378875.79-237810336673832 && echo $HOME/.ansible/tmp/ansible-tmp-1471378875.79-237810336673832'"]
fatal: [127.0.0.1] => SSH encountered an unknown error. The output was:
OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: auto-mux: Trying existing master
debug1: Control socket "/home/mike/.ansible/cp/ansible-ssh-127.0.0.1-2222-mike" does not exist
debug2: ssh_connect: needpriv 0
debug1: Connecting to 127.0.0.1 [127.0.0.1] port 2222.
debug2: fd 3 setting O_NONBLOCK
debug1: fd 3 clearing O_NONBLOCK
debug1: Connection established.
debug3: timeout: 10000 ms remain after connect
debug3: Incorrect RSA1 identifier
debug3: Could not load "/home/mike/.ssh/id_rsa" as a RSA1 public key
debug1: identity file /home/mike/.ssh/id_rsa type 1
debug1: identity file /home/mike/.ssh/id_rsa-cert type -1
debug1: identity file /home/mike/.ssh/id_dsa type -1
debug1: identity file /home/mike/.ssh/id_dsa-cert type -1
debug1: identity file /home/mike/.ssh/id_ecdsa type -1
debug1: identity file /home/mike/.ssh/id_ecdsa-cert type -1
debug1: identity file /home/mike/.ssh/id_ed25519 type -1
debug1: identity file /home/mike/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6
debug1: Remote protocol version 2.0, remote software version OpenSSH_6.7p1 Debian-5+deb8u1
debug1: match: OpenSSH_6.7p1 Debian-5+deb8u1 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug3: put_host_port: [127.0.0.1]:2222
debug3: load_hostkeys: loading entries for host "[127.0.0.1]:2222" from file "/home/mike/.ssh/known_hosts"
debug3: load_hostkeys: found key type ECDSA in file /home/mike/.ssh/known_hosts:2
debug3: load_hostkeys: loaded 1 keys
debug3: order_hostkeyalgs: prefer hostkeyalgs: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug2: kex_parse_kexinit: curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1
debug2: kex_parse_kexinit: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ssh-dss-cert-v01@openssh.com,ssh-rsa-cert-v00@openssh.com,ssh-dss-cert-v00@openssh.com,ssh-ed25519,ssh-rsa,ssh-dss
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-gcm@openssh.com,aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.liu.se
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-gcm@openssh.com,aes256-gcm@openssh.com,chacha20-poly1305@openssh.com,aes128-cbc,3des-cbc,blowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.liu.se
debug2: kex_parse_kexinit: hmac-md5-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-ripemd160-etm@openssh.com,hmac-sha1-96-etm@openssh.com,hmac-md5-96-etm@openssh.com,hmac-md5,hmac-sha1,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,hmac-ripemd160@openssh.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: hmac-md5-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-ripemd160-etm@openssh.com,hmac-sha1-96-etm@openssh.com,hmac-md5-96-etm@openssh.com,hmac-md5,hmac-sha1,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-ripemd160,hmac-ripemd160@openssh.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: zlib@openssh.com,zlib,none
debug2: kex_parse_kexinit: zlib@openssh.com,zlib,none
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit: first_kex_follows 0
debug2: kex_parse_kexinit: reserved 0
debug2: kex_parse_kexinit: curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group14-sha1
debug2: kex_parse_kexinit: ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ssh-ed25519
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,chacha20-poly1305@openssh.com
debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com,chacha20-poly1305@openssh.com
debug2: kex_parse_kexinit: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: kex_parse_kexinit: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: kex_parse_kexinit: none,zlib@openssh.com
debug2: kex_parse_kexinit: none,zlib@openssh.com
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit: first_kex_follows 0
debug2: kex_parse_kexinit: reserved 0
debug2: mac_setup: setup hmac-sha1-etm@openssh.com
debug1: kex: server->client aes128-ctr hmac-sha1-etm@openssh.com zlib@openssh.com
debug2: mac_setup: setup hmac-sha1-etm@openssh.com
debug1: kex: client->server aes128-ctr hmac-sha1-etm@openssh.com zlib@openssh.com
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ECDSA 07:f3:2f:b0:86:b5:b6:2b:d9:f5:26:71:95:6e:d9:ce
debug3: put_host_port: [127.0.0.1]:2222
debug3: put_host_port: [127.0.0.1]:2222
debug3: load_hostkeys: loading entries for host "[127.0.0.1]:2222" from file "/home/mike/.ssh/known_hosts"
debug3: load_hostkeys: found key type ECDSA in file /home/mike/.ssh/known_hosts:2
debug3: load_hostkeys: loaded 1 keys
debug1: Host '[127.0.0.1]:2222' is known and matches the ECDSA host key.
debug1: Found key in /home/mike/.ssh/known_hosts:2
debug1: ssh_ecdsa_verify: signature correct
debug2: kex_derive_keys
debug2: set_newkeys: mode 1
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug2: set_newkeys: mode 0
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug2: service_accept: ssh-userauth
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug2: key: /home/mike/.ssh/id_rsa (0x7fffbdbd5b80),
debug2: key: /home/mike/.ssh/id_dsa ((nil)),
debug2: key: /home/mike/.ssh/id_ecdsa ((nil)),
debug2: key: /home/mike/.ssh/id_ed25519 ((nil)),
debug1: Authentications that can continue: publickey,password
debug3: start over, passed a different list publickey,password
debug3: preferred gssapi-with-mic,gssapi-keyex,hostbased,publickey
debug3: authmethod_lookup publickey
debug3: remaining preferred: ,gssapi-keyex,hostbased,publickey
debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/mike/.ssh/id_rsa
debug3: send_pubkey_test
debug2: we sent a publickey packet, wait for reply
debug1: Authentications that can continue: publickey,password
debug1: Trying private key: /home/mike/.ssh/id_dsa
debug3: no such identity: /home/mike/.ssh/id_dsa: No such file or directory
debug1: Trying private key: /home/mike/.ssh/id_ecdsa
debug3: no such identity: /home/mike/.ssh/id_ecdsa: No such file or directory
debug1: Trying private key: /home/mike/.ssh/id_ed25519
debug3: no such identity: /home/mike/.ssh/id_ed25519: No such file or directory
debug2: we did not send a packet, disable method
debug1: No more authentication methods to try.
Permission denied (publickey,password).

TASK: [ansible-role-unattended-upgrades | add distribution-specific variables] ***
FATAL: no hosts matched or all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/home/mike/role.retry

127.0.0.1                  : ok=0    changed=0    unreachable=1    failed=0

Ansible’s permissions issue when trying to run non-trivial commands without sudo

  • ansible -m ping will work fine without local root permissions, making you think that you might be able to do other ansible operations without sudo
  • Haha! You would be wrong, foolish apprentice
  • Thus, the SSH keys for enabling ansible to work will have to be (a) generated for the local root user and (b) copied to the remote vagrant user
mike@MIKE-WIN10-SSD:~/code/ansible-role-unattended-upgrades$ ansible-playbook -u vagrant role.yml

PLAY [all] ********************************************************************

GATHERING FACTS ***************************************************************
ok: [127.0.0.1]

TASK: [ansible-role-unattended-upgrades | add distribution-specific variables] ***
ok: [127.0.0.1]

TASK: [ansible-role-unattended-upgrades | install unattended-upgrades] ********
failed: [127.0.0.1] => {"failed": true, "item": ""}
stderr: E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?

msg: 'apt-get install 'unattended-upgrades' ' failed: E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied)
E: Unable to lock the administration directory (/var/lib/dpkg/), are you root?


FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/home/mike/role.retry

127.0.0.1                  : ok=2    changed=0    unreachable=0    failed=1

 

Articles I reviewed while doing the work outlined here

https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys–2

http://blog.publysher.nl/2013/07/infra-as-repo-using-vagrant-and-salt.html

https://github.com/devopsgroup-io/vagrant-digitalocean

https://github.com/mitchellh/vagrant/issues/4073

http://stackoverflow.com/questions/23337312/how-do-i-use-rsync-shared-folders-in-vagrant-on-windows

https://github.com/mitchellh/vagrant/issues/3230

https://www.vagrantup.com/docs/synced-folders/basic_usage.html

http://docs.ansible.com/ansible/intro_inventory.html

http://stackoverflow.com/questions/36932952/ansible-unable-to-connect-to-aws-ec2-instance

http://serverfault.com/questions/649659/ansible-try-to-ping-connection-between-localhost-and-remote-server

http://stackoverflow.com/questions/22232509/vagrant-provision-works-but-i-cannot-send-an-ad-hoc-command-with-ansible

http://stackoverflow.com/questions/21670747/what-user-will-ansible-run-my-commands-as#21680256

Windows pre-WSL: fighting with vagrant, virtualbox and Cygwin

These are notes I kept for myself while struggling to find a way to use Vagrant and Virtualbox (among other tools like git, ansible and ssh) to test and debug virtual server builds without having to dual-boot into a Linux box.
Note to DavidP: you’ll see throughout the rsync errors that made this whole thing not *quite* a functional vagrant environment.
Obligatory: “this was of course a foolish mission that many loudly warned me against”.
Obligatory: “I thought I was smarter than most, and could probably decipher the grotty issues” arrogance.
Futile: “I hope these notes help someone else somewhere down the line with a similar issue”.
Important: my work was liberally stealing great ideas from Stefan van Essen:

Basic Setup

Details of what I did to make sure I can recreate my environment later:
  • Installed Cygwin64 with tools like bash, curl, gcc-core, make, nano, openssh, openssl, python, wget (and keep uninstalling python-cryptography because it interferes with my almost-working “ansible-under-cygwin” setup)
  • Installed apt-cyg to make cygwin package management easier
  • Installed VirtualBox for Windows with default paths (leaving the VMs in %username%/VirtualBox VMs)
  • Installed Vagrant for Windows with no changes
  • Created a symlink in the cygwin home directory to the VMs for ease of access (ln -s “/cygdrive/c/Users/Mike/VirtualBox VMs” ~/VMs)

First sign of trouble, selection of build target

  • Problem: “vagrant up” runs into a couple of problems consistently, which might have something to do with me instantiating the vagrant box under Windows (where there wasn’t an available rsync program to enable Vagrant to do whatever it needs when first “up”ing a new Vagrant box)
    • There seemed to be an SSH issue: 
      ==> default: Waiting for machine to boot. This may take a few minutes...
         default: SSH address: 127.0.0.1:2222
         default: SSH username: vagrant
         default: SSH auth method: private key
         default: Warning: Remote connection disconnect. Retrying...
         default: Warning: Remote connection disconnect. Retrying...
      ==> default: Machine booted and ready!
    • Then appeared a “hangover” rsync issue:
      ==> default: Rsyncing folder: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ => /vagrant
      There was an error when attempting to rsync a synced folder.
      Please inspect the error message below for more info.
      Host path: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/
      Guest path: /vagrant
      Command: rsync --verbose --archive --delete -z --copy-links --chmod=ugo=rwX --no-perms --no-owner --no-group --rsync-path sudo rsync -e ssh -p 2222 -o ControlMaster=auto -o ControlPath=C:/cygwin64/tmp/ssh.570 -o ControlPersist=10m -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i 'C:/Users/Mike/VirtualBox VMs/BaseDebian/.vagrant/machines/default/virtualbox/private_key' --exclude .vagrant/ /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ vagrant@127.0.0.1:/vagrant
      Error: Warning: Permanently added '[127.0.0.1]:2222' (ECDSA) to the list of known hosts.
      mm_receive_fd: no message header
      process_mux_new_session: failed to receive fd 0 from slave
      mux_client_request_session: read from master failed: Connection reset by peer
      Failed to connect to new control master
      rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.2]
    • I figured rather than treat this box as a pet, I’d just destroy it and try over again [good practice for breaking my old habits of trying to root cause catastrophic, but sometimes irreproducible, issues]
    • But no – these issues reappear for a freshly-init’d instance of the same vagrant box (debian/jessie64)
    • However, the next vagrant box I tried (hashicorp/precise64) (where I’ll put my dev tools) seemed close enough to fine, no red errors:
      ==> default: Machine booted and ready!
      ==> default: Checking for guest additions in VM...
         default: The guest additions on this VM do not match the installed version of
         default: VirtualBox! In most cases this is fine, but in rare cases it can
         default: prevent things such as shared folders from working properly. If you see
         default: shared folder errors, please make sure the guest additions within the
         default: virtual machine match the version of VirtualBox you have installed on
         default: your host and reload your VM.
         default:
         default: Guest Additions Version: 4.2.0
         default: VirtualBox Version: 5.0
      ==> default: Mounting shared folders...
         default: /vagrant => C:/Users/Mike/VirtualBox VMs/UbuntuDev
    • I also tried with (box-cutter/ubuntu1404-desktop) – worked great!
      ==> default: Machine booted and ready!
      ==> default: Checking for guest additions in VM...
      ==> default: Mounting shared folders...
         default: /vagrant => C:/Users/Mike/VirtualBox VMs/Ubuntu-box-cutter
    • Conclusion: using the box-cutter/ubuntu1404 box for my dev work

Configuration woes

  • Installed git (with suggested additions)
  • Installed ansible
  • Generated ssh keys, ssh-copy-id copied to mike@192.168.1.14 (debian1sttry), then ran “cp ~/.ssh/authorized_keys /root/.ssh” as root
  • Next I wanted to share a single Code folder between Windows and the Ubuntu VM:
    • 1st idea: create a symlink in the shared “vagrant” folder (C:\Users\Mike\VirtualBox VMs\Ubuntu-box-cutter) that would be addressable from inside the Ubuntu VM:
      [from Cygwin on host] ln -s /cygdrive/c/Users/Mike/Code Code
      [from Ubuntu VM] 
      vagrant@vagrant:~$ ls /vagrant/Code
      /vagrant/Code
      vagrant@vagrant:~$ cd /vagrant/Code
      bash: cd: /vagrant/Code: Not a directory
    • 2nd idea: try a native-Windows symlink, not a Cygwin version
      [from CMD on host] 
      C:\Users\Mike\VirtualBox VMs\Ubuntu-box-cutter>mklink /D Code C:\Users\Mike\Code
      symbolic link created for Code <<===>> C:\Users\Mike\Code
      [from Ubuntu VM]
      vagrant@vagrant:/vagrant$ cd /vagrant/Code
      bash: cd: /vagrant/Code: Not a directory
    • 3rd idea: create a new synced folder via vagrant
  • Meanwhile, I started out a new repo (https://github.com/MikeTheCanuck/jQuery-infra-update) to contain the suggested Ansible code (using the Git Bash shell from Git for Windows):
    • mkdir code
      cd code
      touch 50unattended-upgrades
      touch 02periodic
      git add 50unattended-upgrades
      git add 02periodic
    • [then tried git commit but got prompted for identity]
    • git config --global user.email "mikethecanuck@gmail.com"
      git config --global user.name "Mike Lonergan"
      git commit -m "First commit"
      git push origin master
    • [fun: Windows Credential manager prompted for my Github creds]
  • Now to follow the canonical git-flow http://nvie.com/posts/a-successful-git-branching-model/:
    • git checkout -b develop
      git checkout -b default-file-contents develop
      git commit -m "populated default content" code/02periodic code/50unattended-upgrades
      git push
    • [I didn’t think it worked – I didn’t see any changes on the upstream repo – so I tried over]
    • git reset
      git reset --hard
      git commit -m "Added default config file contents" -a
      git push
    • [which resulted in the warning “fatal: The current branch default-file-contents has no upstream branch.  To push the current branch and set the remote as upstream, use…(command)”, so I did]
    • git push --set-upstream origin default-file-contents
  • Result: I have two commits to the branch on my Git repo

Ansible testing environment setup

  • 1st: setup my inventory in /etc/ansible/hosts
  • 2nd: test connectivity “ansible dev -m ping -u root” [cause my target doesn’t have a vagrant account – yeah yeah, I’ll get to building a target using vagrant soon]
  • 3rd: move the files over for testing:
     ansible dev -u root -m copy -a "backup=yes src=~/code/jQuery-infra-update/code/50unattended-upgrades dest=/etc/apt/apt.conf.d/"
     ansible dev -u root -m copy -a "backup=yes src=~/code/jQuery-infra-update/code/02periodic dest=/etc/apt/apt.conf.d/"

Testing vagrant on Debian clones

  • I installed a base Debian box and ignored this error during “vagrant up” that I hoped wouldn’t be a problem:
    ==> default: Checking for guest additions in VM...
       default: No guest additions were detected on the base box for this VM! Guest
       default: additions are required for forwarded ports, shared folders, host only
       default: networking, and more. If SSH fails on this machine, please install
       default: the guest additions and repackage the box to continue.
       default:
       default: This is not an error message; everything may continue to work properly,
       default: in which case you may ignore this message.
    ==> default: Rsyncing folder: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ => /vagrant
    There was an error when attempting to rsync a synced folder
    Please inspect the error message below for more info.
    Host path: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/
    Guest path: /vagrant
    Command: rsync --verbose --archive --delete -z --copy-links --chmod=ugo=rwX --no-perms --no-owner --no-group --rsync-path sudo rsync -e ssh -p 2222 -o ControlMaster=auto -o ControlPath=C:/cygwin64/tmp/ssh.275 -o ControlPersist=10m -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i 'C:/Users/Mike/VirtualBox VMs/BaseDebian/.vagrant/machines/default/virtualbox/private_key' --exclude .vagrant/ /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ vagrant@127.0.0.1:/vagrant
    Error: Warning: Permanently added '[127.0.0.1]:2222' (ECDSA) to the list of known hosts.
    mm_receive_fd: no message header
    process_mux_new_session: failed to receive fd 0 from slave
    mm_send_fd: sendmsg(2): Broken pipe
    mux_client_request_session: send fds failed
    rsync: connection unexpectedly closed (0 bytes received so far) [sender]
    rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.2]
  • Now I’m working from the BaseDebian box I’d setup, and tried to clone a couple of boxes from it, but found that they complained that the VBox Additions weren’t installed and wouldn’t allow me to proceed:
    ==> default: Checking for guest additions in VM...
        default: No guest additions were detected on the base box for this VM! Guest
        default: additions are required for forwarded ports, shared folders, host only
        default: networking, and more. If SSH fails on this machine, please install
        default: the guest additions and repackage the box to continue.
        default:
        default: This is not an error message; everything may continue to work properly,
        default: in which case you may ignore this message.
    ==> default: Mounting shared folders...
        default: /vagrant => C:/Users/Mike/VirtualBox VMs/DebianTest1
    Failed to mount folders in Linux guest. This is usually because
    the "vboxsf" file system is not available. Please verify that
    the guest additions are properly installed in the guest and
    can work properly. The command attempted was:
    
    mount -t vboxsf -o uid=`id -u vagrant`,gid=`getent group vagrant | cut -d: -f3` vagrant /vagrant
    mount -t vboxsf -o uid=`id -u vagrant`,gid=`id -g vagrant` vagrant /vagrant
    
    The error output from the last command was:
    
    stdin: is not a tty
    mount: unknown filesystem type 'vboxsf'
  • OK, so I followed this article and have pulled together these commands to try to get the Additions installed on the base box (here, I destroyed the clones and am working to repair the box from which I cloned)
     sudo apt-get install gcc
     sudo apt-get install make

    (reboot the box)

     sudo mount /dev/cdrom /media/cdrom
     cd /media/cdrom
     sudo ./VBoxLinuxAdditions.run
  • Sadly, this procedure did *not* repair the rsync issue that I was seeing originally before packaging from that box, but I tried vagrant package and clone from this box again, in case what I did happens to have solve the “unknown filesystem type ‘vboxsf'” error
  • Nope, same problem on the cloned box – “unknown filesystem type ‘vboxsf'”
    • Just in case, I tried installing the additions in the cloned box
    • They won’t install – complains about the following:
      Building the VirtualBox Guest Additions kernel modules
      The headers for the current running kernel were not found. If the following
      module compilation fails then this could be the reason.
      
      Building the main Guest Additions module ...fail!
      (Look at /var/log/vboxadd-install.log to find out what went wrong)
      Doing non-kernel setup of the Guest Additions ...done.
    • And that log file says:
      /tmp/vbox.0/Makefile.include.header:97: *** Error: unable to find the sources of your current Linux kernel. Specify KERN_DIR=<directory> and run Make again.  Stop.
      Creating user for the Guest Additions.
      Creating udev rule for the Guest Additions kernel module.
    • So I tried this article next, but that didn’t entirely help – the headers still aren’t available
    • However, instead of erroring out, the VBox Guest Additions installation said this at the same place:
      Building the VirtualBox Guest Additions kernel modules
      The headers for the current running kernel were not found. If the following
      module compilation fails then this could be the reason.
      
      Building the main Guest Additions module ...done.
      Building the shared folder support module ...done.
      Building the graphics driver module ...done.
      Doing non-kernel setup of the Guest Additions ...done.
      Starting the VirtualBox Guest AdditionsInstalling the Window System drivers
      Could not find the X.Org or XFree86 Window System, skipping.
      ...done.
    • HOLY SHIT GUYS, it actually worked:
       default: Mounting shared folders...
         default: /vagrant => C:/Users/Mike/VirtualBox VMs/DebianTest1
    • So I rebuilt the base Debian box with this procedure – sadly it didn’t address the rsync issue (that was a valiant wish), but at least I should be able to clone this box and avoid the ‘vboxsf’ issue
    • To address the rsync issue, I tried a variant on this article’s top answer – placed the cwrsync folder on my C:\ root, then copied the script to an rsync.bat file

Issues I never solved

  • Q: what’s the source of the actual problem with the rsync on my BaseDebian box?
    • TODO: Follow this article for some ideas on troubleshooting/narrowing down the problem
  • Q: what’s the source of the problem that causes my “vagrant up” to fail twice on the SSH auth?

Other references

Bash/Ubuntu on Win10: getting *nix vagrant working with virtualbox (not)

TL;DR Getting vagrant + virtualbox running natively in Bash for Unbuntu on Windows is a no-go.  Try a hybrid Windows/WSL solution instead.

At the end of our last episode, our hero was trapped under the following paradox:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ vagrant up
VirtualBox is complaining that the installation is incomplete. Please
run `VBoxManage --version` to see the error message which should contain
instructions on how to fix this error.
mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ VBoxManage --version
WARNING: The character device /dev/vboxdrv does not exist.
         Please install the virtualbox-dkms package and the appropriate
         headers, most likely linux-headers-3.4.0+.

         You will not be able to start VMs until this problem is fixed.

However, the advice for installing virtualbox-dkms is merely a distraction:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ sudo apt-get install virtualbox-dkms
Reading package lists... Done
Building dependency tree
Reading state information... Done
virtualbox-dkms is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 44 not upgraded.

And installing linux-headers-3.4.0+ doesn’t seem to work:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ sudo apt-get install linux-headers-3.4.0+
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package linux-headers-3.4.0
E: Couldn't find any package by regex 'linux-headers-3.4.0'

Where to go from here?

AskUbuntu turns up this tasty lead:

http://askubuntu.com/questions/465454/problem-with-the-installation-of-virtualbox

…where VBoxManage instead indicates “…most likely linux-headers-generic”.  This aligns with my previous investigation into the version of Linux that ships with Bash on Ubuntu for Windows (‘uname -r’ returns “3.4.0+”, which I suspect is what VBoxManage appends to its “most likely” hint).

Aside

On a lark, I decided to see if I could confirm this theory from the virtualbox source code.  Since it’s Oracle, of course they had to use an “enterprise-y” repo (Trac) which provides a browseable but not searchable front-end, so I pawed through each of the .cpp files by hand on the off-chance this message was being constructed directly in VBoxManage*.cpp source:

https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Frontends/VBoxManage

It’s entirely possible the message is passed up from an imported library, or that it’s constructed from fragments that don’t explicitly include the string “most likely” in any one line of source, but I wasn’t able to find it from this branch of the virtualbox source repo.

Dead End, Take a Guess

OK, if there’s no specific indication which version of the headers must be used, and on the assumption no damage can be caused by downloading what should merely be text files, then let’s just try the linux-headers-generic and see what happens.

And the apt-get messages seem promising – especially that it selected linux-headers-3.13.* files magically without me tracking down which specific versions I needed:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ sudo apt-get install linux-headers-generic
[sudo] password for mike:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
  linux-headers-3.13.0-92 linux-headers-3.13.0-92-generic
The following NEW packages will be installed:
  linux-headers-3.13.0-92 linux-headers-3.13.0-92-generic
  linux-headers-generic
0 upgraded, 3 newly installed, 0 to remove and 44 not upgraded.
Need to get 9,571 kB of archives.
After this operation, 77.0 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Cool, except for these lines in the script output:

Examining /etc/kernel/header_postinst.d.
run-parts: executing /etc/kernel/header_postinst.d/dkms 3.13.0-92-generic /boot/vmlinuz-3.13.0-92-generic

(why does that make me think I just overwrote something important?)

…and for reasons initially unknown to me, the run-parts script seems to go zombie.

[Aside: I’m still too polluted by my Win32 experience, so I kept trying to interrupt with Ctrl-C.  No.  Bad dog, no treat.  Instead, try Ctrl-Z (pronounced “zed”, ’cause I’m Canadian like that.]

Finding Out if Anyone Else Has Seen This

Vagrant is a pretty popular way of managing virtual machines these days, right?  Yeah.  And while I might be in the first days of the public release of Bash on Windows, there’s been an Insiders Preview going for months, and lots of people banging on the corners.

So what are the odds someone else has tried this too?

Old school: search stackoverflow.com, social.technet.microsoft.com.  No bueno – plenty of folks reporting issues on SO with Bash on Windows, but no one there has reported this vagrant problem.

New school: somehow stumbled across the github repo for BashOnWindows, and dutifully filled out as detailed an issue report as I could muster.

=== NOW HERE’S THE PART THAT BLEW MY MIND ===

A Microsoft employee responded with an intelligent and helpful reply within hours on the same day!!!

(I remember a decade ago, Microsoft’s ‘engagement’ with customers reporting real issues with new software – even when Microsoft’s external bug trackers existed – was abysmal.  You’d be lucky to get an acknowledgement inside a month, and rarely if ever would they bother to update the issue when/if it ever got dispositioned, let alone addressed.  THIS KIND OF RESPONSIVENESS IS AMAZING FROM A CORPORATION.)

Root Issue

My bad, I’d misunderstood the implications of this: WSL (Windows Subsystem for Linux), which supports the user-mode Bash on Ubuntu layer, doesn’t implement any native Linux kernel support.  It’s all user-mode support, and it’s only for non-GUI apps (i.e. things that don’t require Display:0).

Our intrepid Microsoft employee reports here that DKMS isn’t currently supported.  The fact I took it even further to try installing the linux headers was moot; /dev/vboxdrv wouldn’t be available no matter what.

Cleanup in Aisle 4

Did you happen to go down the same road as me?  [What, are you similarly touched in the head?]  If so, here’s what I did to back out of my mess:

  • Performed the lock/install package cleanup specified here
  • Did as clean an uninstall of the linux-headers-generic package as I could (running sudo apt-get –purge remove linux-headers-generic), which outputs
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    The following packages will be REMOVED:
      linux-headers-generic*
    0 upgraded, 0 newly installed, 1 to remove and 47 not upgraded.
    2 not fully installed or removed.
    After this operation, 29.7 kB disk space will be freed.
    Do you want to continue? [Y/n]

    …which leads to the same run-parts script that fails.  Cleanup the locks/install packages again…then pray not enough damage was done by run-parts (in either direction) to matter. [Boy is that a landmine waiting to go off months from now…]

  • Clean uninstall of vagrant (sudo apt-get –purge remove vagrant)…which somehow leads again to these same lines:
    ...
    Do you want to continue? [Y/n]
    (Reading database ... 64370 files and directories currently installed.)
    Removing vagrant (1.4.3-1) ...
    Purging configuration files for vagrant (1.4.3-1) ...
    Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
    Setting up linux-headers-3.13.0-92-generic (3.13.0-92.139) ...
    Examining /etc/kernel/header_postinst.d.
    run-parts: executing /etc/kernel/header_postinst.d/dkms 3.13.0-92-generic /boot/vmlinuz-3.13.0-92-generic

    Ctrl-Z, rm locks and install bits.  [This is getting old.]

  • Clean uninstall of virtualbox (sudo apt-get –purge remove virtualbox)…and once again that unkillable linux-headers setup rears its head.
  • Let’s look closer.
  • Here’s the preamble when removing the vagrant package:
    mike@MIKE-WIN10-SSD:/var/lib/dpkg/updates$ sudo apt-get --purge remove vagrant
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    The following packages were automatically installed and are no longer required:
      bsdtar libarchive13 liblzo2-2 libnettle4 libruby1.9.1 ruby ruby-childprocess
      ruby-erubis ruby-ffi ruby-i18n ruby-log4r ruby-net-scp ruby-net-ssh
      ruby1.9.1
    Use 'apt-get autoremove' to remove them.
    The following packages will be REMOVED:
      vagrant*
    0 upgraded, 0 newly installed, 1 to remove and 47 not upgraded.
    1 not fully installed or removed.
    After this operation, 1,612 kB disk space will be freed.
    Do you want to continue? [Y/n]

    [My italics for emphasis]

  • Maybe – just MAYBE – the “not fully installed” package is linux-headers-generic, and if I could coax apt-get or dpkg to clean *that* up, we’d rid ourselves of this mess.  [*foreshadowing*  …or maybe I just need to find out how to wipe and reinstantiate Bash on Windows…]
  • First, do the suggested cleanup (sudo apt-get autoremove)
  • Then install debfoster and deborphan
  • Debfoster reports nothing interesting, but deborphan reports:
    deborphan: The status file is in an improper state.
    One or more packages are marked as half-installed, half-configured,
    unpacked, triggers-awaited or triggers-pending. Exiting.
  • This article provides a great grep for isolating the issue – here’s what it uncovered:
    Package: linux-headers-3.13.0-92-generic
    Status: install ok half-configured
    --
    Package: dialog
    Status: install ok unpacked
    --
    Package: debfoster
    Status: install ok unpacked
    --
    Package: deborphan
    Status: install ok unpacked
  • sudo dpkg –audit reports:
    The following packages are only half configured, probably due to problems
    configuring them the first time.  The configuration should be retried using
    dpkg --configure <package> or the configure menu option in dselect:
      linux-headers-3.13.0-92-generic Linux kernel headers for version 3.13.0 on 64
  • We already know “retry” isn’t the answer here…
  • sudo dpkg –configure –pending definitely kicks off the dead-end configuration of the headers…what can cause this to back out, or to remove the stuff that keeps getting triggered?
  • As I was about to uninstall Bash for Ubuntu, I (for no reason) ran exit from within the Bash shell, which showed me this new output:
    mike@MIKE-WIN10-SSD:/var/lib/dpkg$ exit
    exit
    run-parts: waitpid: Interrupted system call
    Failed to process /etc/kernel/header_postinst.d at /var/lib/dpkg/info/linux-headers-3.13.0-92-generic.postinst line 110.
    dpkg: error processing package linux-headers-3.13.0-92-generic (--configure):
    subprocess installed post-installation script returned error exit status 4
    Setting up dialog (1.2-20130928-1) …
  • After a few minutes I just killed the window
  • Restarted Bash, didn’t appear to have made any improvements on my situation.

Perhaps it’s time to finally throw in the towel.

Complete rebuild of Bash on Ubuntu

When all else fails, uninstall and reinstall.  Thankfully I hadn’t invested a ton of real work into this…

According to this comment, the following command cleans up the whole deal:

Lxrun /uninstall /full

(Coda: In case you get an error re-installing afterwards, try running this command again.  I happened to end up with error code 0x80070091 for which I could find no help, but others have reported other error codes too.)

Let’s try this again from scratch.

Hope: I discovered the cbwin project is being actively developed, to enable users of Bash on Ubuntu for Win10 to launch Windows binaries from within the bash environment.  I’ll try this for the vagrant/virtualbox combo and report back.

Update

I quickly ran into limits with cbwin in this particular setup, but seemed to have found peace with a hybrid approach.

Running *nix apps on Win10 “Anniversary update”: initial findings

Ever since Microsoft announced the “Bash on Windows” inclusion in the Anniversary update of Win10, I’ve been positively *itching* to try it out.  I spent *hours* in Git Bash, Cygwin and other workarounds inside Windows to get tools like Vagrant to work natively in Windows.

Spoiler: it never quite worked. [Aside: if anyone has any idea how to get rsync to work in Cygwin or similarly *without* the Bash shell on Windows, let’s talk.  That was the killer flaw.]

Deciphering the (hidden) installation of Bash

I downloaded the update first thing this morning and got it installed, turned on Developer Mode, then…got stumped by Hanselman’s article (above) on how exactly to get the shell/subsystem itself installed.  [Seems like something got mangled in translation, since “…and adding the Feature, run you bash and are prompted to get Ubuntu on Windows from Canonical via the Windows Store…” doesn’t make any grammatical sense, and searching the Windows Store for “ubuntu”, “bash” or “canonical” didn’t turn up anything useful.]

The Windows10 subreddit’s megathread today left incomplete instructions, and a rumour that this was only available on Win10 Pro (not Home).

Instead, it turns out that you have to navigate to legacy control panel to enable, after you’ve turned on Developer Mode (thanks MSDN Blogs):

Control Panel >> Programs >> Turn Windows features on or off, then check “Windows Subsystem for Linux (Beta)”.  Then reboot once again.

WindowsFeaturesDialog

Then fire up CMD.EXE and type “bash” to initiate the installation of “Ubuntu on Windows”:

BashUbuntuWin10Install

Now to use it!

Once installed, it’s got plenty of helpful hints built in (or else Ubuntu has gotten even easier than I remember), such as:

BashUbuntuWin10Hints

Npm, rpm, vagrant, git, ansible, virtualbox are similarly ‘hinted’.

Getting up-to-date software installed

Weirdly, Ansible 1.5.4 was installed, not from the 2.x version.  What gives?  OK, time to chase a rat through a rathole…

This article implies I could try to get a trusty-backport of ansible:
https://blogs.msdn.microsoft.com/commandline/2016/04/06/bash-on-ubuntu-on-windows-download-now-3/

Does that mean the Ubuntu on Windows is effectively an old version of Ubuntu?  How can I even figure that out?

Running ‘apt-get –version’ indicates we have apt 1.0.1ubuntu2 for amb64 compiled on Jan 12 2016.  That seems relatively recent…

Running ‘apt-cache policy ansible’ gives me the following output:

mike@MIKE-WIN10-SSD:/etc/apt$ apt-cache policy ansible 
ansible: 
  Installed: 1.5.4+dfsg-1 
  Candidate: 1.5.4+dfsg-1 
  Version table: 
 *** 1.5.4+dfsg-1 0 
        500 http://archive.ubuntu.com/ubuntu/ trusty/universe amd64 Packages 
        100 /var/lib/dpkg/status

Looking at /etc/apt/sources.list, there’s only three listed by default:

mike@MIKE-WIN10-SSD:/etc/apt$ cat sources.list 
deb http://archive.ubuntu.com/ubuntu trusty main restricted universe multiverse 
deb http://archive.ubuntu.com/ubuntu trusty-updates main restricted universe multiverse 
deb http://security.ubuntu.com/ubuntu trusty-security main restricted universe multiverse

So is there some reason why Ubuntu-on-Windows’ package manager (apt) doesn’t even list > 1.5.4 as an available installation?  ‘Cause I was previously running v2.2.0 of Ansible on native Ubuntu (just last month).

I *could* run from source in a subdirectory from my home directory – but I’m shamefully (blissfully?) unaware of the implications – are there common configuration files that might stomp on each other?  Is there common code stuffed in some dark location that is better left alone?

Or should I add the source repo mentioned here?  That seems the safest option, because then apt should manage the dependencies and not leave me with two installs of ansible (1.5.4 and 2.x).

Turns out the “Latest Releases via Apt (Ubuntu)” seems to have done well enough – now ‘ansible –version’ returns “ansible 2.1.1.0”, which appears to be latest according to https://launchpad.net/~ansible/+archive/ubuntu/ansible.

Deciphering hands-on install dependencies

Next I tried installing vagrant, which went OK, but then complained about an incomplete installation:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ vagrant up

VirtualBox is complaining that the installation is incomplete. Please run
`VBoxManage --version` to see the error message which should contain
instructions on how to fix this error. 

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ VBoxManage --version 
WARNING: The character device /dev/vboxdrv does not exist. 
         Please install the virtualbox-dkms package and the appropriate 
         headers, most likely linux-headers-3.4.0+. 

         You will not be able to start VMs until this problem is fixed.

So, tried ‘sudo apt-get install linux-headers-3.4.0’ and it couldn’t find a match.  Tried ‘apt-cache search linux-headers’ and it came back with a wide array of options – 3.13, 3.16, 3.19, 4.2, 4.4 (many subversions and variants available).

Stopped me dead in my tracks – which one would be appropriate to the Ubuntu that ships as “Ubuntu for Windows” in the Win10 Anniversary Update?  Not that header files *should* interact with the operations of the OS, but on the off-chance that there’s some unexpected interaction, I’d rather be a little methodical than have to figure out how to wipe and reinstall.

Figuring out what is the equivalent “version of Ubuntu” that ships with this subsystem isn’t trivial:

  • According to /etc/issue, it’s “Ubuntu 14.04.4 LTS”.
  • What version of the Linux kernel comes with 14.04.4?
  • According to ‘uname -r’, it’s “3.4.0+”, which seems suspiciously under-specific.
  • According to /proc/version, its “Linux version 3.4.0-Microsoft (Microsoft@Microsoft.com) (gcc version 4.7 (GCC) ) #1 SMP PREEMPT Wed Dec 31  14:42:53 PST 2014”.

That’s enough for one day – custom versions of the OS should make one ponder.  Tune in next time to see what kind of destruction I can wring out of my freshly-unix-ized Windows box.

P.S. Note to self: it’s cool to get an environment running; it’s even better for it to stay up to date.  This dude did a great job of documenting his process for keeping all the packages current.

Goodbye Evernote, you doublespeak snake

I got the note recently from Evernote, much like all my friends, that starts:

At Evernote, we are committed not only to making you as productive as you can be, but also to running our business in as transparent a way as possible. We’re making a change to our Basic service, and it’s important that you know about it.

They go on to drop the bombshell that they’re raising prices on the paid plans, changing feature allocation, and most egregiously, putting new limitations on the Basic (free) accounts. Limiting free users to only having two devices is a serious attack on all those who’ve been working under the assumption that we should continue to invest our memories, data and daily workflow into their platform with the knowledge that we’d always be able to use it everywhere we were.

My PM Perspective

It’s obvious that Evernote’s VCs are putting the screws to them – either that “growth in user signups” is no longer what they’re after, but top-line growth in revenue – or that (more likely) Evernote’s growth in new users is fast waning (the market is now matured), and they need to start farming those millions of naïve suckers for increased revenue.

Either way, the revenue play couldn’t be more obvious. Make it nearly laughably inconvenient for most of their users to use, and at the same time boost the pricing on their subscription plans. Dangerous move kids.

In one way I have hand it to the Product Manager(s) who came up with this scheme – this is the boldest move they could make to convert free users over. I have to believe that they looked at their user base’s # of connected devices before deciding that “2” would be the magic tipping point (i.e. most users have at least three frequently-used devices to use Evernote).

It’s unfortunate that they had to rush the subscription price increase at the same time, however; had they waited another 6 months to let the hordes rush over to their more affordable plans, and *then* raised prices, they might not have caught our ire quite so widely.

Here’s a more winning alternative: convert me to one of the existing paid subscriptions, let me get used to the new normal, settle in to my old habits, let my addiction to the convenience of Evernote at these lower prices take hold again, and *then* ask for a little more money (see Netflix’s recent move to boost streaming prices by a couple of dollars, without changing features *or* the DVD subscription pricing).

I’m gladly giving Netflix the extra dollars, as I currently feel *happy* and agreeable to them, even though I don’t particularly feel like they should need more money for an increasingly-crappy catalogue of non-Netflix movies (which is precisely why I hang on to the DVD subscription).

Had they earned my trust either up to now or through these transitions, I suspect I’d be singing their praises, not migrating out wholesale.

Value Delivery Issues

The obvious value gap – for me – is in what I get back for the price at each subscription level.  At the Plus tier ($34.99/year), there isn’t one feature that I’ve ever actually needed – except the multi-device access (which just pisses me off).  At the Premium tier ($64.99/year), the only feature I would enjoy is the Business Card scanning feature – and as mentioned below, I have learned to mistrust it.

Perhaps I’m an outlier in feeling burned by Evernote, but I sincerely doubt it: there’s been an undercurrent of discontent with Evernote’s products for years:

  • Hello/Food starved out, features added then removed, unceremoniously dumped, and their rich content templates never migrated to Evernote.
  • Business Card scanning feature totally screwy – added to Hello with a “one-year free” promo then silently removed; added to Scannable with another “one-year free” promo then also silently removed from there; template for the scanned data changed at least once that I noticed, and made increasingly less useful *and* less searchable throughout my limited use of it. NOTE: I’ve never had any of these changes communicated to me through their apps, so I don’t know what noob Product/Community Manager thinks they are smoking when they say “running the business in as transparent a way as possible”. I’ve learned not to believe that for a second.
  • Many obvious features requested repeatedly with no acknowledgement let alone commitment (see the forums for a good laugh)
  • An “editor” feature that’s been abysmally poorly written, such that pasting content from Evernote notes is laughably painful for those using that content elsewhere like WordPress
  • A web-based interface that’s tolerable for read-only but horrific for writing – even in Chrome, it can’t seem to auto-save without creating multiple conflicting copies of the same note

And outlier or not, within 24 hours I went out to find the tools to migrate my eight years worth of notes over to OneNote. I’d looked into this a few times over the years, hearing good things Windows types said about OneNote, but never quite believed that it had outgrown its wart-on-the-ass-end-of-Office roots. Given the obvious migration incentive though, I’m willing to give it a try.

OneNote After A Week

So far OneNote has been performant and useable:

  • the migration tool seemed to bring over all the content, though (because of Evernote’s screwy CDATA embedded-HTML formatting) the layout of the content is loose and weirdly-fonted.
  • The difference between Notebooks/embedded notebooks and Notebooks/sections took me a bit (and two tries with the migration so that I got an acceptable arrangement of Tagged content), but otherwise so far so good.
  • I have no evidence that I’ve lost any data, or had any sync conflicts arise. (Unlike Evernote in app or on web)
  • The in-app editor is clean and hasn’t done anything I didn’t want it to do.

Bonus: the web editing experience is seamless.

Tip: if you want full editing power in Windows, use the free Desktop client, not the default “Trusted App” (the Win10/Metro/tablet-oriented app).

Status and failure of CacheMyWork development

Every month or so I get an email like the following:

Hi, Love your app CacheMyWork. Left some suggestions for improvement on your website a while ago. Any chance you’ll release an updated version with more functionality any time soon? Seems to my it has huge potential once it’s updated.

I get this kind of email – wondering when I’ll finish the app, asking about status, wondering why it seems incomplete or why I haven’t integrated [easy fix “X”] – for only one of the open source projects I ever released: http://CacheMyWork.codeplex.com.

Every time I get one of these, I feel like a jerk for not finishing (or continuing) what I started, and I realize that if I was just a better coder, I’d have a lot of happy people out there.

I could never get my head around the databinding that is necessary to connect the WPF front end i built for the “version 2” of my app to the well-formed, totally functional app-finding algorithm I built literally years ago.

I feel like I’m disappointing a lot of people by not getting this back underway, but I really don’t have a clue how to fix the damned thing. I think my failure to maintain/improve this project is one reason why I’m giving up on my dream to ever be a professional coder.

I swear something must be wrong with my brain – every time I try to re-learn databinding concepts for .NET they look like they’re pretty simple for the author, but when I try to apply the ideas to my code, it never seems to work. I’ve coded three different data classes, I’ve tried every combination of parameters in the binding (both the XAML and the code-behind) I could find, but at best I get code parameters in the UI – never anything that hints that the bound data is leaking through (even though I can clearly see the data in the data class when I set breakpoints in the debugger). It’s like I’m not “getting” something about how this is supposed to work – it reminds me of how I was a week or two behind in introductory calculus class, when my brain couldn’t visualize what it was we were manipulating with those damned equations. (I finally got the calculus, though I think by now I’d have to start all over again.)

What would you do in this situation? I’d really like to get this going again – at least make good on the unfinished “new” release, and give myself some closure on that chapter of my geek life.

Lenovo T61? High CPU usage? Look for bad drivers (…spewing hardware interrupts)

For a couple of months, I noticed some pretty unusual CPU utilization on my Lenovo T61 Thinkpad (running Windows XP SP3).  I run Process Explorer (ProcExp) in the background all day, to keep an eye out for unusual system performance behaviour, and lately I’ve noticed that almost an entire “core” on my Intel dual-core system is consumed by something running in Kernel space.

notificationAreaI have ProcExp configured to display the CPU meter in the “systray” (aka notification area), and it’s really easy to know when there’s a kernel-consumer on your system – that’s when the CPU utilization graph will show a sizeable proportion of the line graph in red, rather than its more typical green (for User space consumers).

procexp Pop open ProcExp under this condition, and if you’ve spent as much time watching these behaviour profiles as I have, and you’ll be as surprised as I was to find the “Interrupts” process (really, just an abstract aggregate) reporting CPU utilization in the 30’s or 40’s.

 

Summary of the solution: bad drivers, needed to be updated.  In my case it was a network driver that had just been updated by the vendor; for others the culprits seem to be video drivers.  I’m sure other failures exist, but these are the two most common.

Step 1: Device Manager

When hardware and/or drivers are acting up like this, check to make sure your drivers are loading properly in Windows:

  • Open the Device Manager control panel applet (e.g. run “devmgmt.msc”)
  • Look at the list to see if any categories are expanded
  • In every expanded category, examine the details of any device that has a yellow exclamation mark (indicating there is some problem with the driver).  If at all possible, resolve the issue that’s been identified.

Until all driver issues have been resolved, any other suggestions you follow below could be moot.  If you really can’t fix the problem(s) that Device Manager is presenting, but want to try other approaches, then there are a couple of tactical workarounds you may wish to try:

  1. Install a newer (or older) version of the driver, to see if it’s just the current driver version that’s having problems.
  2. Disable the device temporarily, and make sure you are still seeing the excessive hardware interrupts being generated for a sustained period.  (If not, then perhaps that was the offending driver, and perhaps you’ll be able to run with the device disabled while the vendor addresses the issue in a later version of the driver.  Lucky you.)

Step 2: KernRate

Microsoft has an interesting little package, developed for use on Windows XP, to dig into a little detail on the Kernel-related activity on the system.  It’s not exactly a wizard I’d send to my mom, but it’s helped at least narrow the potential issues to track down for me.

You can download KernRate from here.

2.1 Setup & Config for a valid KernRate result

  • Install Windows Debugging Tools
  • Install KernRate
  • Configure valid symbols PATH variable – see this thread for details
    • Create a new directory C:\Symbols
    • Create a new environment variable by clicking on: Start menu > Control Panel > System > Advanced > Environment Variables > click New under “user variables for …”, then type:
      • Variable name: _NT_SYMBOL_PATH
      • Variable value:
  • Copy DBGHELP.DLL and SYMSRV.DLL from the install location for Windows Debugging Tools (usually C:\Program Files\Debugging Tools for Windows (x86)) to the install location for KernRate (usually C:\Program Files\KrView\Kernrates)

2.2 Use KernRate to profile kernel activity

Once it’s installed and configured correctly, you can just fire up Kernrate_i386_xp.exe, let it run for a few minutes (during the period when the Kernel activity is not what you’d expect), and then hit [Ctrl]-C to get the results of the Kernrate run.

Here’s a typical output from my system during normal operating conditions  (NOTE: for every Kernrate output, I’ve left out the entries from the final table that have no significant “hits” i.e. no significant volume of kernel events):

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe
/==============================\

\==============================/
Date: 2009/06/09   Time: 13:50:38
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:11.296 (17.7%)  U 0:00:06.171 ( 9.7%)  I 0:00:46.328 (72.6%)  DPC
0:00:00.078 ( 0.1%)  Interrupt 0:00:00.031 ( 0.0%)
       Interrupts= 76929, Interrupt Rate= 1206/sec.

P1     K 0:00:12.328 (19.3%)  U 0:00:07.718 (12.1%)  I 0:00:43.750 (68.6%)  DPC
0:00:00.109 ( 0.2%)  Interrupt 0:00:00.531 ( 0.8%)
       Interrupts= 15687, Interrupt Rate= 246/sec.

TOTAL  K 0:00:23.625 (18.5%)  U 0:00:13.890 (10.9%)  I 0:01:30.078 (70.6%)  DPC
0:00:00.187 ( 0.1%)  Interrupt 0:00:00.562 ( 0.4%)
       Total Interrupts= 92616, Total Interrupt Rate= 1452/sec.

Total Profile Time = 63796 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       184655872,       180600832,        -4055
040
    Available Pagefile(s)       ,      1075990528,      1069088768,        -6901
760
    Available Virtual           ,      2132107264,      2131058688,        -1048
576
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,       949736,         14887/sec.
    System Calls         ,      2754324,         43173/sec.
    Page Faults          ,       799583,         12533/sec.
    I/O Read Operations  ,         5382,         84/sec.
    I/O Write Operations ,         4657,         73/sec.
    I/O Other Operations ,        92064,         1443/sec.
    I/O Read Bytes       ,     15853022,         2946/ I/O
    I/O Write Bytes      ,      1968451,         423/ I/O
    I/O Other Bytes      ,      9179346,         100/ I/O

—————————–

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 229
Percentage in the following table is based on the Total Hits for the Kernel

Time   11604 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               9041      63796    77 %     3542933
hal                                    1155      63796     9 %      452614
win32k                                  604      63796     5 %      236691
mfehidk                                 179      63796     1 %       70145
intelppm                                163      63796     1 %       63875
igxpmp32                                112      63796     0 %       43889
HIPK                                     98      63796     0 %       38403 
……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

And here’s the output from my system during one of these CPU over-consumption cases:

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe
/==============================\

\==============================/
Date: 2009/06/10   Time:  8:41:26
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:02:26.906 (80.8%)  U 0:00:21.640 (11.9%)  I 0:00:13.328 ( 7.3%)  DPC
0:00:01.078 ( 0.6%)  Interrupt 0:01:18.406 (43.1%)
       Interrupts= 3919032, Interrupt Rate= 21548/sec.

P1     K 0:02:14.000 (73.7%)  U 0:00:32.703 (18.0%)  I 0:00:15.171 ( 8.3%)  DPC
0:00:00.640 ( 0.4%)  Interrupt 0:01:02.890 (34.6%)
       Interrupts= 3837550, Interrupt Rate= 21100/sec.

TOTAL  K 0:04:40.906 (77.2%)  U 0:00:54.343 (14.9%)  I 0:00:28.500 ( 7.8%)  DPC
0:00:01.718 ( 0.5%)  Interrupt 0:02:21.296 (38.8%)
       Total Interrupts= 7756582, Total Interrupt Rate= 42648/sec.

Total Profile Time = 181875 msec

                                       BytesStart          BytesStop         By
esDiff.
    Available Physical Memory   ,       564551680,       571240448,         668
768
    Available Pagefile(s)       ,      2194202624,      2233167872,        3896
248
    Available Virtual           ,      2132402176,      2131353600,        -104
576
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,      1764882,         9704/sec.
    System Calls         ,     14821228,         81491/sec.
    Page Faults          ,       188084,         1034/sec.
    I/O Read Operations  ,        31768,         175/sec.
    I/O Write Operations ,        27985,         154/sec.
    I/O Other Operations ,       321733,         1769/sec.
    I/O Read Bytes       ,     24755975,         779/ I/O
    I/O Write Bytes      ,     14122756,         505/ I/O
    I/O Other Bytes      ,     31150904,         97/ I/O

—————————–
Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   62240 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                              40159     181875    64 %     5520137
hal                                   10990     181875    17 %     1510652
igxpmp32                               2381     181875     3 %      327285
HECI                                   2170     181875     3 %      298281
win32k                                 1495     181875     2 %      205498
intelppm                               1335     181875     2 %      183505
mfehidk                                 931     181875     1 %      127972
HIPK                                    660     181875     1 %       90721
Ntfs                                    361     181875     0 %       49621
usbuhci                                 285     181875     0 %       39175
PGPwded                                 223     181875     0 %       30652

……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================
 

 

2.3 KernRate detailed profiling on individual modules

So what I’m seeing is (like many other folks on the ‘Net) that the NTKRNLPA module is consuming the lions’ share of the CPU (being consumed by Kernel-space threads).  Following some of the threads on the Sysinternals message boards, I dug into the details of what images were being loaded by NTKRNLPA during these kernel events:

2.3.1 NTKRNLPA

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -s 5 -z ntkrnlpa
/==============================\

\==============================/
Date: 2009/06/10   Time:  8:49:42
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -s 5 -z ntkrnlpa

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for 804d7000 \WINDOWS\system32\ntkrnlpa.exe

Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:01.984 (39.7%)  U 0:00:00.406 ( 8.1%)  I 0:00:02.609 (52.2%)  DPC
0:00:00.031 ( 0.6%)  Interrupt 0:00:01.421 (28.4%)
       Interrupts= 97650, Interrupt Rate= 19530/sec.

P1     K 0:00:02.125 (42.5%)  U 0:00:00.453 ( 9.1%)  I 0:00:02.421 (48.4%)  DPC
0:00:00.062 ( 1.3%)  Interrupt 0:00:01.406 (28.1%)
       Interrupts= 95410, Interrupt Rate= 19082/sec.

TOTAL  K 0:00:04.109 (41.1%)  U 0:00:00.859 ( 8.6%)  I 0:00:05.031 (50.3%)  DPC
0:00:00.093 ( 0.9%)  Interrupt 0:00:02.828 (28.3%)
       Total Interrupts= 193060, Total Interrupt Rate= 38612/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       563130368,       560410624,        -2719
744
    Available Pagefile(s)       ,      2230300672,      2226749440,        -3551
232
    Available Virtual           ,      2132402176,      2104180736,       -28221
440
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        73005,         14601/sec.
    System Calls         ,       214633,         42927/sec.
    Page Faults          ,         3814,         763/sec.
    I/O Read Operations  ,         1175,         235/sec.
    I/O Write Operations ,          433,         87/sec.
    I/O Other Operations ,         4569,         914/sec.
    I/O Read Bytes       ,       776604,         661/ I/O
    I/O Write Bytes      ,       129278,         299/ I/O
    I/O Other Bytes      ,       575190,         126/ I/O

—————————–
Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1762 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1307       4984    74 %     6555979
hal                                     246       4984    13 %     1233948
HECI                                     44       4984     2 %      220706
igxpmp32                                 38       4984     2 %      190609
intelppm                                 36       4984     2 %      180577
win32k                                   33       5000     1 %      165000
mfehidk                                  15       5000     0 %       75000
HIPK                                     12       5000     0 %       60000
……

===> Processing Zoomed Module ntkrnlpa.exe…

—– Zoomed module ntkrnlpa.exe (Bucket size = 16 bytes, Rounding Down) ——-

Percentage in the following table is based on the Total Hits for this Zoom Module

Time   1307 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
KiDispatchInterrupt                     787       4984    46 %     3947632
READ_REGISTER_USHORT                    383       4984    22 %     1921147
READ_REGISTER_ULONG                     342       4984    20 %     1715489
READ_REGISTER_UCHAR                      41       4984     2 %      205658
KeReleaseInStackQueuedSpinLockFromDpcLevel         13       4984     0 %       65208
ZwYieldExecution                         10       4984     0 %       50160

……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================
 

2.3.2 HAL

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -v -s 5 -z hal
/==============================\

\==============================/
Date: 2009/06/10   Time: 18:25:00
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -v -s 5 -z hal

KERNRATE: current IMAGEHLP SymOptions: UNDNAME DEBUG

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
KERNRATE: IMAGEHLP symbol search path is: .;symsrv*symsrv.dll*c:\symbols*
http://
msdl.microsoft.com/download/symbols;C:\WINDOWS\System32\Drivers;C:\WINDOWS\System32;C:\WINDOWS
CallBack: Loading symbols for hal.dll…
DBGHELP: No header for hal.dll.  Searching for image on disk
DBGHELP: C:\Program Files\KrView\Kernrates\hal.dll – file not found
DBGHELP: hal.dll not found in .
DBGHELP: hal.dll not found in C:\WINDOWS\System32\Drivers
DBGHELP: C:\WINDOWS\System32\hal.dll – OK
DBGHELP: C:\WINDOWS\System32\hal.dll found
DBGHELP: .\halmacpi.pdb – file not found
DBGHELP: .\dll\halmacpi.pdb – file not found
DBGHELP: .\symbols\dll\halmacpi.pdb – file not found
CallBack: Finished Attempt to Load symbols for 806e4000 \WINDOWS\system32\hal.dll

DBGHELP: hal – public symbols
         c:\symbols\halmacpi.pdb\9875FD697ECA4BBB8A475825F6BF885E1\halmacpi.pdb
Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:02.531 (50.6%)  U 0:00:00.453 ( 9.1%)  I 0:00:02.015 (40.3%)  DPC
0:00:00.000 ( 0.0%)  Interrupt 0:00:01.890 (37.8%)
       Interrupts= 119072, Interrupt Rate= 23814/sec.

P1     K 0:00:02.609 (52.2%)  U 0:00:00.656 (13.1%)  I 0:00:01.734 (34.7%)  DPC
0:00:00.031 ( 0.6%)  Interrupt 0:00:01.437 (28.7%)
       Interrupts= 114272, Interrupt Rate= 22854/sec.

TOTAL  K 0:00:05.140 (51.4%)  U 0:00:01.109 (11.1%)  I 0:00:03.750 (37.5%)  DPC
0:00:00.031 ( 0.3%)  Interrupt 0:00:03.328 (33.3%)
       Total Interrupts= 233344, Total Interrupt Rate= 46669/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       111005696,       117772288,         6766
592
    Available Pagefile(s)       ,      1635250176,      1639911424,         4661
248
    Available Virtual           ,      2132402176,      2131013632,        -1388
544
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        82262,         16452/sec.
    System Calls         ,       292287,         58457/sec.
    Page Faults          ,         6431,         1286/sec.
    I/O Read Operations  ,         1592,         318/sec.
    I/O Write Operations ,          506,         101/sec.
    I/O Other Operations ,         7687,         1537/sec.
    I/O Read Bytes       ,       344208,         216/ I/O
    I/O Write Bytes      ,       457023,         903/ I/O
    I/O Other Bytes      ,       951393,         124/ I/O

—————————–

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1761 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1223       5000    69 %     6115000
hal                                     258       5000    14 %     1290000
intelppm                                 74       5000     4 %      370000
HECI                                     55       5000     3 %      275000
win32k                                   32       5000     1 %      160000
igxpmp32                                 29       5000     1 %      145000
HIPK                                     22       5000     1 %      110000
mfehidk                                  15       5000     0 %       75000
……

===> Processing Zoomed Module hal.dll…

—– Zoomed module hal.dll (Bucket size = 16 bytes, Rounding Down) ——–
Percentage in the following table is based on the Total Hits for this Zoom Module

Time   258 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
READ_PORT_UCHAR                         143       5000    25 %      715000
READ_PORT_ULONG                         139       5000    24 %      695000
READ_PORT_USHORT                        139       5000    24 %      695000
HalpQueryPerformanceCounter              24       5000     4 %      120000
KeReleaseQueuedSpinLock                  21       5000     3 %      105000
KeAcquireInStackQueuedSpinLock           21       5000     3 %      105000
KfLowerIrql                               8       5000     1 %       40000
ExAcquireFastMutex                        7       5000     1 %       35000
KeGetCurrentIrql                          7       5000     1 %       35000
HalpGetCmosCenturyByte                    7       5000     1 %       35000
HalpReleaseSystemHardwareSpinLock          7       5000     1 %       35000
HalBeginSystemInterrupt                   6       5000     1 %       30000
KeAcquireQueuedSpinLock                   5       5000     0 %       25000
……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

2.3.3 IGXPMP32

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -v -s 5 -z igxpmp32
/==============================\

\==============================/
Date: 2009/06/10   Time: 18:25:11
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -v -s 5 -z igxpmp32

KERNRATE: current IMAGEHLP SymOptions: UNDNAME DEBUG

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
KERNRATE: IMAGEHLP symbol search path is: .;symsrv*symsrv.dll*c:\symbols*
<a href="http://
“>“>”>”>”>”>”>http://
msdl.microsoft.com/download/symbols;C:\WINDOWS\System32\Drivers;C:\WINDOWS\System32;C:\WINDOWS
CallBack: Loading symbols for igxpmp32.sys…
DBGHELP: No header for igxpmp32.sys.  Searching for image on disk
DBGHELP: C:\Program Files\KrView\Kernrates\igxpmp32.sys – file not found
DBGHELP: igxpmp32.sys not found in .
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.sys – OK
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.sys found
DBGHELP: .\igxpmp32.pdb – file not found
DBGHELP: .\sys\igxpmp32.pdb – file not found
DBGHELP: .\symbols\sys\igxpmp32.pdb – file not found
SYMSRV:  The server name or address could not be resolved
SYMSRV:  The server name or address could not be resolved
SYMSRV:  c:\symbols\igxpmp32.pdb\01C98FF4DD054B6AB11509CF9CF04F171\igxpmp32.pdb
not found
SYMSRV: 
http://msdl.microsoft.com/download/symbols/igxpmp32.pdb/01C98FF4DD054B6
AB11509CF9CF04F171/igxpmp32.pdb not found
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\symbols\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\symbols\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\symbols\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.pdb – file not found
DBGHELP: D:\ccViews\autobuild1_BR-0810-1787_14.36_Snapshot\gfx_Development\dump3
2\igfx\xp\gtune\AIM3Lib\igxpmp32.pdb – file not found
CallBack: Finished Attempt to Load symbols for b942f000 \SystemRoot\system32\DRI
VERS\igxpmp32.sys

DBGHELP: igxpmp32 – export symbols
Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:02.812 (56.1%)  U 0:00:00.265 ( 5.3%)  I 0:00:01.937 (38.6%)  DPC
0:00:00.000 ( 0.0%)  Interrupt 0:00:01.812 (36.1%)
       Interrupts= 114753, Interrupt Rate= 22879/sec.

P1     K 0:00:02.937 (58.8%)  U 0:00:00.562 (11.3%)  I 0:00:01.500 (30.0%)  DPC
0:00:00.062 ( 1.3%)  Interrupt 0:00:01.578 (31.6%)
       Interrupts= 109952, Interrupt Rate= 21990/sec.

TOTAL  K 0:00:05.750 (57.4%)  U 0:00:00.828 ( 8.3%)  I 0:00:03.437 (34.3%)  DPC
0:00:00.062 ( 0.6%)  Interrupt 0:00:03.390 (33.9%)
       Total Interrupts= 224705, Total Interrupt Rate= 44941/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       119107584,       107962368,       -11145
216
    Available Pagefile(s)       ,      1640800256,      1629646848,       -11153
408
    Available Virtual           ,      2132402176,      2102800384,       -29601
792
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        78959,         15792/sec.
    System Calls         ,       301466,         60293/sec.
    Page Faults          ,         8704,         1741/sec.
    I/O Read Operations  ,          711,         142/sec.
    I/O Write Operations ,          197,         39/sec.
    I/O Other Operations ,         5430,         1086/sec.
    I/O Read Bytes       ,      1212541,         1705/ I/O
    I/O Write Bytes      ,        14701,         75/ I/O
    I/O Other Bytes      ,      1486190,         274/ I/O

—————————–

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1764 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1232       5000    69 %     6160000
hal                                     235       5000    13 %     1175000
intelppm                                 93       5015     5 %      463609
HECI                                     42       5015     2 %      209371
win32k                                   34       5000     1 %      170000
igxpmp32                                 32       5015     1 %      159521
mfehidk                                  23       5000     1 %      115000
HIPK                                     21       5000     1 %      105000
kmixer                                   11       5000     0 %       55000
KSecDD                                   10       5015     0 %       49850
……

===> Processing Zoomed Module igxpmp32.sys…

—– Zoomed module igxpmp32.sys (Bucket size = 16 bytes, Rounding Down) ——-

Percentage in the following table is based on the Total Hits for this Zoom Module

Time   32 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
xc_copp_agent                            30       5015    100 %      149551

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

2.3.4 HECI

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -v -s 5 -z heci
/==============================\

\==============================/
Date: 2009/06/10   Time: 18:48:17
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -v -s 5 -z heci

KERNRATE: current IMAGEHLP SymOptions: UNDNAME DEBUG

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
KERNRATE: IMAGEHLP symbol search path is: .;symsrv*symsrv.dll*c:\symbols*
<a href="http://
“>“>”>”>”>”>”>http://
msdl.microsoft.com/download/symbols;C:\WINDOWS\System32\Drivers;C:\WINDOWS\System32;C:\WINDOWS
CallBack: Loading symbols for HECI.sys…
DBGHELP: No header for HECI.sys.  Searching for image on disk
DBGHELP: C:\Program Files\KrView\Kernrates\HECI.sys – file not found
DBGHELP: HECI.sys not found in .
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.sys – OK
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.sys found
DBGHELP: .\HECI.pdb – file not found
DBGHELP: .\sys\HECI.pdb – file not found
DBGHELP: .\symbols\sys\HECI.pdb – file not found
SYMSRV:  The server name or address could not be resolved
SYMSRV:  The server name or address could not be resolved
SYMSRV:  c:\symbols\HECI.pdb\0A57FC20FA904FB8BAEC77C53EE39C821\HECI.pdb not found
SYMSRV: 
http://msdl.microsoft.com/download/symbols/HECI.pdb/0A57FC20FA904FB8BAE
C77C53EE39C821/HECI.pdb not found
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\symbols\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\symbols\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\symbols\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.pdb – file not found
DBGHELP: d:\ccviews\autobuild1_br-0707-1049_2.6_snapshot\amt_development\sw\src\
drivers\heci\driver\objfre_wxp_x86\i386\HECI.pdb – file not found
CallBack: Finished Attempt to Load symbols for ba308000 \SystemRoot\system32\DRI
VERS\HECI.sys

DBGHELP: HECI – no symbols loaded
Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:02.296 (45.9%)  U 0:00:00.343 ( 6.9%)  I 0:00:02.359 (47.2%)  DPC
0:00:00.000 ( 0.0%)  Interrupt 0:00:01.359 (27.2%)
       Interrupts= 115350, Interrupt Rate= 23070/sec.

P1     K 0:00:02.203 (44.1%)  U 0:00:00.531 (10.6%)  I 0:00:02.265 (45.3%)  DPC
0:00:00.015 ( 0.3%)  Interrupt 0:00:01.093 (21.9%)
       Interrupts= 110550, Interrupt Rate= 22110/sec.

TOTAL  K 0:00:04.500 (45.0%)  U 0:00:00.875 ( 8.8%)  I 0:00:04.625 (46.3%)  DPC
0:00:00.015 ( 0.2%)  Interrupt 0:00:02.453 (24.5%)
       Total Interrupts= 225900, Total Interrupt Rate= 45180/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       116039680,       114077696,        -1961
984
    Available Pagefile(s)       ,      1642790912,      1640251392,        -2539
520
    Available Virtual           ,      2132402176,      2105364480,       -27037
696
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        91187,         18237/sec.
    System Calls         ,       270278,         54056/sec.
    Page Faults          ,         6482,         1296/sec.
    I/O Read Operations  ,          269,         54/sec.
    I/O Write Operations ,           41,         8/sec.
    I/O Other Operations ,         5361,         1072/sec.
    I/O Read Bytes       ,      1551742,         5769/ I/O
    I/O Write Bytes      ,         2836,         69/ I/O
    I/O Other Bytes      ,      2381876,         444/ I/O

—————————–
Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1754 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1273       5000    72 %     6365000
hal                                     222       5000    12 %     1110000
intelppm                                 70       5000     3 %      350000
HECI                                     47       5000     2 %      235000
igxpmp32                                 34       5000     1 %      170000
win32k                                   30       5000     1 %      150000
mfehidk                                  14       5000     0 %       70000
HIPK                                     12       5000     0 %       60000
KSecDD                                   12       5000     0 %       60000
……

===> Processing Zoomed Module HECI.sys…

Symbol Enumeration failed (or no symbols found) on module HECI in CreateZoomedModuleList, Error Code= 0x000001e7
No Hits or No symbols found for module HECI.sys
================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

 

Step 3: Analysis & Research

In my case, it appears that there’s some process (or processes) that’s calling the RtlIpv6StringToAddressA kernel API an excessive number of times.  From what I’ve seen in my research, not only is this a common complaint, but it’s been reported by at least two other T61 users in the last year or so – without any definitive solution.

And what’s this RtlIpv6StringToAddressA API do exactly?  Well, it translates a string form of an IPv6 address to a more strongly-typed form of the address.  Here’s what I know (and what I don’t) about IPv6 in my T61 and Windows XP SP3:

  • The “teredo” driver hasn’t been installed in Windows XP SP3 (“teredo” is the code name used by the Microsoft Windows team that developed the IPv6 stack)
  • There are a number of third-party packages on my system, but the newest to have been introduced are:
    • the AMT 2.6-supporting software and driver (v “2.6.30.1014 built by: WinDDK”, driver date 7/12/2007)
    • the PGP whole disk encryption package
    • the Intel 4965AGN wireless driver (v12.4.0.21), installed earlier
    • the Intel PROSet/Wireless Wifi Software (v12.1.1.0), installed earlier
  • Knowing that AMT is meant to provide a remote networking interface to Intel chipset functionality, and knowing that this problem didn’t occur before the AMT & PGP installs took place, my primary suspect (as the source of these excessive IPv6-related API calls) becomes the AMT software.
  • However, I consulted with colleagues on my team that are intimately familiar with the internals of HECI.sys, and they have confirmed that it doesn’t call the RtlIpv6StringToAddressA() Win32 API
  • That prompted me to search the strings embedded in all the drivers and libraries I could find in C:\Windows\System32.  Here were the files where FINDSTR.EXE could find that API call:
    • dnsapi.dll
    • mswsock.dll
    • ntdll.dll
    • ntkrnlpa.exe
    • ntoskrnl.exe
    • Drivers\http.sys

Disabled the following Devices (which didn’t help):

  • Ricoh Memory Stick Controller
  • Ricoh SD/MMC Host Controller
  • Ricoh xD-Picture Card Controller
  • OHCI Compliant IEEE 1394 Host Controller
  • ThinkPad Modem
  • Ricoh R/RL/5C476(II) or Compatible CardBus Controller
  • SDA Standard Compliant SD Host Controller

Downloaded and installed the following driver updates:

  • bluetooth v5.5.09.6400
  • Ultranav v11.1.21.2
  • Power Manager v1.60
  • ACPI Power Management v1.53
  • Integrated camera v5.8.49004.0
  • Hotkey driver v2.20.0001

Step 4: Make sure you try later and earlier driver versions

Bottom line: I replaced the network driver on my system with one that appeared *months* after I started down this road, and I finally stopped seeing these CPU spews occur.

I was lucky – the vendor dropped a new driver version that happened to resolve this error.  Until that point, I was stuck – nothing I could do to my OS to resolve it.

And if your vendor hasn’t released a *later* driver that eliminates the issue, I’ll recommend something radical – try an *earlier* version of the driver (e.g. like these folks here).  Sometimes programmers introduce errors that didn’t exist in previous versions of their code, even if they didn’t intend to affect such a broad issue.  (Especially when they didn’t intend to.)  You might find a 6-, 12- or 24-month-old driver that still works well enough (for the time being) and doesn’t exhibit the performance drag you’re seeing.  What’s worse – an older driver that works but doesn’t have the latest features, or a driver that has the features but also craters your system performance?

Last resort: you may find that disabling the device in the OS will finally shut up a misbehaving driver (i.e. that the driver won’t load, so it won’t spawn all these excessive interrupts).  This isn’t a great long-term solution, and it’s only viable for hardware you don’t need to use (e.g. a biometric device such as fingerprint reader), but it’s one last thing to consider before junking your whole system.

Notes: related articles

Sysinternals Forums

  1. Hardware interrupts using excessive CPU
  2. Kernrate trace + RtlIpv6StringToAddressA (?)

Lenovo Forums

  1. x200 Hardware Interrupts constantly at 40% of CPU resources
  2. X200: hardware interrupts causes cpu load?
  3. T61: Problems with Hardware Interrupts when on battery
  4. Problem on S12 with flood of interrupts from Broadcom wireless-G after Windows resume

DLL Injection in Windows: what security countermeasures can you use?

Manage the Administrators group

Examine any default install of Windows since NT4 SP6.  You’ll notice that the SeDebugPrivilege is assigned by default only to the .\Administrators local group of the Windows host.  While this isn’t exactly unusual for users to be members of Administrators on their own PC, don’t think that every user or process automatically gets this capability in Windows.

cuffedCountermeasure: If you want to assert an explicit distinction between those who do and do not have the SeDebugPrivilege on a Windows system, explicitly manage the membership of the Administrators local group.  This is especially useful (and applicable) to Windows Servers, where most of your users won’t have (or have need for) this membership.

How to implement:

  • run the net localgroup command locally e.g. with these parameters: “net localgroup Administrators NAME_OF_USER_OR_GROUP_TO_REMOVE /Delete” (or run it remotely via a remote-shell tool such as psexec.exe)
  • configure a Group Policy (e.g. using Active Directory group policy) that sets the membership of the Administrators group using the Restricted Groups security setting (either overwriting the existing membership or incrementally adding/deleting specified security principals)

Manage the SeDebugPrivilege

The obvious flipside of the default SeDebugPrivilege assignment is that you can change the security principals to whom the privilege is assigned.  In fact, if you review (or have implemented) the Microsoft Security Security Accelerators for Windows (Windows XP, Windows Server 2003, Windows Vista, Windows Server 2008), you’ll find they recommend

Countermeasure: remove the Administrators group from the SeDebugPrivilege.

How to implement:

  • run the ntrights.exe Resource Kit command-line tool locally e.g. with these parameters: “ntrights.exe –u Administrators –r SeDebugPrivilege” (or run it remotely via a remote-shell tool such as psexec.exe)
  • configure a Group Policy (e.g. using Active Directory group policy) that removes all security principals that are assigned the SeDebugPrivilege privilege

Run Apps, Services as lesser-privileged user

So the first two BKMs are great and all, but there are still lots of situations where you can’t make these blanket changes to the entire OS (though thankfully virtualization is reducing these “shared system” problems).  You may have to find ways to launch one or more processes with different security context or privileges than the rest of the system – sometimes having to run something with more privilege than the rest of the system (e.g. try Sudo for Windows), but usually wanting to strip privilege and permissions away from specific processes.

jailbirdCountermeasure: use Windows’ Software Restriction Policy (aka SRP or “SAFER”) to strip the Token of as many groups and privileges as the application can tolerate.  You don’t have to set a restrictive policy for the whole system – you can set this on an application-by-application basis (which can be practical in server environments, where you may only have a few critical applications to have to protect from each other).

How to implement:

  • Download and use Michael Howard’s SetSAFER application, which will strip varying levels of privileges and groups from the security token assigned to the process (thus making it more difficult for the process to access privileged objects in Windows).  If you want to dig into the code for this, and if the source code isn’t available, you can take a look at the code included in the original article (on which SetSAFER was based), or fire up DotNet Reflector and inspect the MSIL for the SetSAFER “executable”.
  • You could also try “psexec –l” (which implements one of the approaches taken by SetSAFER – one of the “stripped-down profiles”).

 

Something about this feels like I’ve missed another approach that should be mentioned in this context, but I’m sure there’s smarter folks than I reading this who can add any missing details to the picture.  Thanks, and have fun with this!