Windows pre-WSL: fighting with vagrant, virtualbox and Cygwin

These are notes I kept for myself while struggling to find a way to use Vagrant and Virtualbox (among other tools like git, ansible and ssh) to test and debug virtual server builds without having to dual-boot into a Linux box.
Note to DavidP: you’ll see throughout the rsync errors that made this whole thing not *quite* a functional vagrant environment.
Obligatory: “this was of course a foolish mission that many loudly warned me against”.
Obligatory: “I thought I was smarter than most, and could probably decipher the grotty issues” arrogance.
Futile: “I hope these notes help someone else somewhere down the line with a similar issue”.
Important: my work was liberally stealing great ideas from Stefan van Essen:

Basic Setup

Details of what I did to make sure I can recreate my environment later:
  • Installed Cygwin64 with tools like bash, curl, gcc-core, make, nano, openssh, openssl, python, wget (and keep uninstalling python-cryptography because it interferes with my almost-working “ansible-under-cygwin” setup)
  • Installed apt-cyg to make cygwin package management easier
  • Installed VirtualBox for Windows with default paths (leaving the VMs in %username%/VirtualBox VMs)
  • Installed Vagrant for Windows with no changes
  • Created a symlink in the cygwin home directory to the VMs for ease of access (ln -s “/cygdrive/c/Users/Mike/VirtualBox VMs” ~/VMs)

First sign of trouble, selection of build target

  • Problem: “vagrant up” runs into a couple of problems consistently, which might have something to do with me instantiating the vagrant box under Windows (where there wasn’t an available rsync program to enable Vagrant to do whatever it needs when first “up”ing a new Vagrant box)
    • There seemed to be an SSH issue: 
      ==> default: Waiting for machine to boot. This may take a few minutes...
         default: SSH address: 127.0.0.1:2222
         default: SSH username: vagrant
         default: SSH auth method: private key
         default: Warning: Remote connection disconnect. Retrying...
         default: Warning: Remote connection disconnect. Retrying...
      ==> default: Machine booted and ready!
    • Then appeared a “hangover” rsync issue:
      ==> default: Rsyncing folder: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ => /vagrant
      There was an error when attempting to rsync a synced folder.
      Please inspect the error message below for more info.
      Host path: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/
      Guest path: /vagrant
      Command: rsync --verbose --archive --delete -z --copy-links --chmod=ugo=rwX --no-perms --no-owner --no-group --rsync-path sudo rsync -e ssh -p 2222 -o ControlMaster=auto -o ControlPath=C:/cygwin64/tmp/ssh.570 -o ControlPersist=10m -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i 'C:/Users/Mike/VirtualBox VMs/BaseDebian/.vagrant/machines/default/virtualbox/private_key' --exclude .vagrant/ /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ vagrant@127.0.0.1:/vagrant
      Error: Warning: Permanently added '[127.0.0.1]:2222' (ECDSA) to the list of known hosts.
      mm_receive_fd: no message header
      process_mux_new_session: failed to receive fd 0 from slave
      mux_client_request_session: read from master failed: Connection reset by peer
      Failed to connect to new control master
      rsync: connection unexpectedly closed (0 bytes received so far) [sender]
      rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.2]
    • I figured rather than treat this box as a pet, I’d just destroy it and try over again [good practice for breaking my old habits of trying to root cause catastrophic, but sometimes irreproducible, issues]
    • But no – these issues reappear for a freshly-init’d instance of the same vagrant box (debian/jessie64)
    • However, the next vagrant box I tried (hashicorp/precise64) (where I’ll put my dev tools) seemed close enough to fine, no red errors:
      ==> default: Machine booted and ready!
      ==> default: Checking for guest additions in VM...
         default: The guest additions on this VM do not match the installed version of
         default: VirtualBox! In most cases this is fine, but in rare cases it can
         default: prevent things such as shared folders from working properly. If you see
         default: shared folder errors, please make sure the guest additions within the
         default: virtual machine match the version of VirtualBox you have installed on
         default: your host and reload your VM.
         default:
         default: Guest Additions Version: 4.2.0
         default: VirtualBox Version: 5.0
      ==> default: Mounting shared folders...
         default: /vagrant => C:/Users/Mike/VirtualBox VMs/UbuntuDev
    • I also tried with (box-cutter/ubuntu1404-desktop) – worked great!
      ==> default: Machine booted and ready!
      ==> default: Checking for guest additions in VM...
      ==> default: Mounting shared folders...
         default: /vagrant => C:/Users/Mike/VirtualBox VMs/Ubuntu-box-cutter
    • Conclusion: using the box-cutter/ubuntu1404 box for my dev work

Configuration woes

  • Installed git (with suggested additions)
  • Installed ansible
  • Generated ssh keys, ssh-copy-id copied to mike@192.168.1.14 (debian1sttry), then ran “cp ~/.ssh/authorized_keys /root/.ssh” as root
  • Next I wanted to share a single Code folder between Windows and the Ubuntu VM:
    • 1st idea: create a symlink in the shared “vagrant” folder (C:\Users\Mike\VirtualBox VMs\Ubuntu-box-cutter) that would be addressable from inside the Ubuntu VM:
      [from Cygwin on host] ln -s /cygdrive/c/Users/Mike/Code Code
      [from Ubuntu VM] 
      vagrant@vagrant:~$ ls /vagrant/Code
      /vagrant/Code
      vagrant@vagrant:~$ cd /vagrant/Code
      bash: cd: /vagrant/Code: Not a directory
    • 2nd idea: try a native-Windows symlink, not a Cygwin version
      [from CMD on host] 
      C:\Users\Mike\VirtualBox VMs\Ubuntu-box-cutter>mklink /D Code C:\Users\Mike\Code
      symbolic link created for Code <<===>> C:\Users\Mike\Code
      [from Ubuntu VM]
      vagrant@vagrant:/vagrant$ cd /vagrant/Code
      bash: cd: /vagrant/Code: Not a directory
    • 3rd idea: create a new synced folder via vagrant
  • Meanwhile, I started out a new repo (https://github.com/MikeTheCanuck/jQuery-infra-update) to contain the suggested Ansible code (using the Git Bash shell from Git for Windows):
    • mkdir code
      cd code
      touch 50unattended-upgrades
      touch 02periodic
      git add 50unattended-upgrades
      git add 02periodic
    • [then tried git commit but got prompted for identity]
    • git config --global user.email "mikethecanuck@gmail.com"
      git config --global user.name "Mike Lonergan"
      git commit -m "First commit"
      git push origin master
    • [fun: Windows Credential manager prompted for my Github creds]
  • Now to follow the canonical git-flow http://nvie.com/posts/a-successful-git-branching-model/:
    • git checkout -b develop
      git checkout -b default-file-contents develop
      git commit -m "populated default content" code/02periodic code/50unattended-upgrades
      git push
    • [I didn’t think it worked – I didn’t see any changes on the upstream repo – so I tried over]
    • git reset
      git reset --hard
      git commit -m "Added default config file contents" -a
      git push
    • [which resulted in the warning “fatal: The current branch default-file-contents has no upstream branch.  To push the current branch and set the remote as upstream, use…(command)”, so I did]
    • git push --set-upstream origin default-file-contents
  • Result: I have two commits to the branch on my Git repo

Ansible testing environment setup

  • 1st: setup my inventory in /etc/ansible/hosts
  • 2nd: test connectivity “ansible dev -m ping -u root” [cause my target doesn’t have a vagrant account – yeah yeah, I’ll get to building a target using vagrant soon]
  • 3rd: move the files over for testing:
     ansible dev -u root -m copy -a "backup=yes src=~/code/jQuery-infra-update/code/50unattended-upgrades dest=/etc/apt/apt.conf.d/"
     ansible dev -u root -m copy -a "backup=yes src=~/code/jQuery-infra-update/code/02periodic dest=/etc/apt/apt.conf.d/"

Testing vagrant on Debian clones

  • I installed a base Debian box and ignored this error during “vagrant up” that I hoped wouldn’t be a problem:
    ==> default: Checking for guest additions in VM...
       default: No guest additions were detected on the base box for this VM! Guest
       default: additions are required for forwarded ports, shared folders, host only
       default: networking, and more. If SSH fails on this machine, please install
       default: the guest additions and repackage the box to continue.
       default:
       default: This is not an error message; everything may continue to work properly,
       default: in which case you may ignore this message.
    ==> default: Rsyncing folder: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ => /vagrant
    There was an error when attempting to rsync a synced folder
    Please inspect the error message below for more info.
    Host path: /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/
    Guest path: /vagrant
    Command: rsync --verbose --archive --delete -z --copy-links --chmod=ugo=rwX --no-perms --no-owner --no-group --rsync-path sudo rsync -e ssh -p 2222 -o ControlMaster=auto -o ControlPath=C:/cygwin64/tmp/ssh.275 -o ControlPersist=10m -o StrictHostKeyChecking=no -o IdentitiesOnly=true -o UserKnownHostsFile=/dev/null -i 'C:/Users/Mike/VirtualBox VMs/BaseDebian/.vagrant/machines/default/virtualbox/private_key' --exclude .vagrant/ /cygdrive/c/Users/Mike/VirtualBox VMs/BaseDebian/ vagrant@127.0.0.1:/vagrant
    Error: Warning: Permanently added '[127.0.0.1]:2222' (ECDSA) to the list of known hosts.
    mm_receive_fd: no message header
    process_mux_new_session: failed to receive fd 0 from slave
    mm_send_fd: sendmsg(2): Broken pipe
    mux_client_request_session: send fds failed
    rsync: connection unexpectedly closed (0 bytes received so far) [sender]
    rsync error: error in rsync protocol data stream (code 12) at io.c(226) [sender=3.1.2]
  • Now I’m working from the BaseDebian box I’d setup, and tried to clone a couple of boxes from it, but found that they complained that the VBox Additions weren’t installed and wouldn’t allow me to proceed:
    ==> default: Checking for guest additions in VM...
        default: No guest additions were detected on the base box for this VM! Guest
        default: additions are required for forwarded ports, shared folders, host only
        default: networking, and more. If SSH fails on this machine, please install
        default: the guest additions and repackage the box to continue.
        default:
        default: This is not an error message; everything may continue to work properly,
        default: in which case you may ignore this message.
    ==> default: Mounting shared folders...
        default: /vagrant => C:/Users/Mike/VirtualBox VMs/DebianTest1
    Failed to mount folders in Linux guest. This is usually because
    the "vboxsf" file system is not available. Please verify that
    the guest additions are properly installed in the guest and
    can work properly. The command attempted was:
    
    mount -t vboxsf -o uid=`id -u vagrant`,gid=`getent group vagrant | cut -d: -f3` vagrant /vagrant
    mount -t vboxsf -o uid=`id -u vagrant`,gid=`id -g vagrant` vagrant /vagrant
    
    The error output from the last command was:
    
    stdin: is not a tty
    mount: unknown filesystem type 'vboxsf'
  • OK, so I followed this article and have pulled together these commands to try to get the Additions installed on the base box (here, I destroyed the clones and am working to repair the box from which I cloned)
     sudo apt-get install gcc
     sudo apt-get install make

    (reboot the box)

     sudo mount /dev/cdrom /media/cdrom
     cd /media/cdrom
     sudo ./VBoxLinuxAdditions.run
  • Sadly, this procedure did *not* repair the rsync issue that I was seeing originally before packaging from that box, but I tried vagrant package and clone from this box again, in case what I did happens to have solve the “unknown filesystem type ‘vboxsf'” error
  • Nope, same problem on the cloned box – “unknown filesystem type ‘vboxsf'”
    • Just in case, I tried installing the additions in the cloned box
    • They won’t install – complains about the following:
      Building the VirtualBox Guest Additions kernel modules
      The headers for the current running kernel were not found. If the following
      module compilation fails then this could be the reason.
      
      Building the main Guest Additions module ...fail!
      (Look at /var/log/vboxadd-install.log to find out what went wrong)
      Doing non-kernel setup of the Guest Additions ...done.
    • And that log file says:
      /tmp/vbox.0/Makefile.include.header:97: *** Error: unable to find the sources of your current Linux kernel. Specify KERN_DIR=<directory> and run Make again.  Stop.
      Creating user for the Guest Additions.
      Creating udev rule for the Guest Additions kernel module.
    • So I tried this article next, but that didn’t entirely help – the headers still aren’t available
    • However, instead of erroring out, the VBox Guest Additions installation said this at the same place:
      Building the VirtualBox Guest Additions kernel modules
      The headers for the current running kernel were not found. If the following
      module compilation fails then this could be the reason.
      
      Building the main Guest Additions module ...done.
      Building the shared folder support module ...done.
      Building the graphics driver module ...done.
      Doing non-kernel setup of the Guest Additions ...done.
      Starting the VirtualBox Guest AdditionsInstalling the Window System drivers
      Could not find the X.Org or XFree86 Window System, skipping.
      ...done.
    • HOLY SHIT GUYS, it actually worked:
       default: Mounting shared folders...
         default: /vagrant => C:/Users/Mike/VirtualBox VMs/DebianTest1
    • So I rebuilt the base Debian box with this procedure – sadly it didn’t address the rsync issue (that was a valiant wish), but at least I should be able to clone this box and avoid the ‘vboxsf’ issue
    • To address the rsync issue, I tried a variant on this article’s top answer – placed the cwrsync folder on my C:\ root, then copied the script to an rsync.bat file

Issues I never solved

  • Q: what’s the source of the actual problem with the rsync on my BaseDebian box?
    • TODO: Follow this article for some ideas on troubleshooting/narrowing down the problem
  • Q: what’s the source of the problem that causes my “vagrant up” to fail twice on the SSH auth?

Other references

Bash/Ubuntu on Win10: getting *nix vagrant working with virtualbox (not)

TL;DR Getting vagrant + virtualbox running natively in Bash for Unbuntu on Windows is a no-go.  Try a hybrid Windows/WSL solution instead.

At the end of our last episode, our hero was trapped under the following paradox:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ vagrant up
VirtualBox is complaining that the installation is incomplete. Please
run `VBoxManage --version` to see the error message which should contain
instructions on how to fix this error.
mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ VBoxManage --version
WARNING: The character device /dev/vboxdrv does not exist.
         Please install the virtualbox-dkms package and the appropriate
         headers, most likely linux-headers-3.4.0+.

         You will not be able to start VMs until this problem is fixed.

However, the advice for installing virtualbox-dkms is merely a distraction:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ sudo apt-get install virtualbox-dkms
Reading package lists... Done
Building dependency tree
Reading state information... Done
virtualbox-dkms is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 44 not upgraded.

And installing linux-headers-3.4.0+ doesn’t seem to work:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ sudo apt-get install linux-headers-3.4.0+
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package linux-headers-3.4.0
E: Couldn't find any package by regex 'linux-headers-3.4.0'

Where to go from here?

AskUbuntu turns up this tasty lead:

http://askubuntu.com/questions/465454/problem-with-the-installation-of-virtualbox

…where VBoxManage instead indicates “…most likely linux-headers-generic”.  This aligns with my previous investigation into the version of Linux that ships with Bash on Ubuntu for Windows (‘uname -r’ returns “3.4.0+”, which I suspect is what VBoxManage appends to its “most likely” hint).

Aside

On a lark, I decided to see if I could confirm this theory from the virtualbox source code.  Since it’s Oracle, of course they had to use an “enterprise-y” repo (Trac) which provides a browseable but not searchable front-end, so I pawed through each of the .cpp files by hand on the off-chance this message was being constructed directly in VBoxManage*.cpp source:

https://www.virtualbox.org/browser/vbox/trunk/src/VBox/Frontends/VBoxManage

It’s entirely possible the message is passed up from an imported library, or that it’s constructed from fragments that don’t explicitly include the string “most likely” in any one line of source, but I wasn’t able to find it from this branch of the virtualbox source repo.

Dead End, Take a Guess

OK, if there’s no specific indication which version of the headers must be used, and on the assumption no damage can be caused by downloading what should merely be text files, then let’s just try the linux-headers-generic and see what happens.

And the apt-get messages seem promising – especially that it selected linux-headers-3.13.* files magically without me tracking down which specific versions I needed:

mike@MIKE-WIN10-SSD:/mnt/c/Users/Mike/VirtualBox VMs/BaseDebianServer$ sudo apt-get install linux-headers-generic
[sudo] password for mike:
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
  linux-headers-3.13.0-92 linux-headers-3.13.0-92-generic
The following NEW packages will be installed:
  linux-headers-3.13.0-92 linux-headers-3.13.0-92-generic
  linux-headers-generic
0 upgraded, 3 newly installed, 0 to remove and 44 not upgraded.
Need to get 9,571 kB of archives.
After this operation, 77.0 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Cool, except for these lines in the script output:

Examining /etc/kernel/header_postinst.d.
run-parts: executing /etc/kernel/header_postinst.d/dkms 3.13.0-92-generic /boot/vmlinuz-3.13.0-92-generic

(why does that make me think I just overwrote something important?)

…and for reasons initially unknown to me, the run-parts script seems to go zombie.

[Aside: I’m still too polluted by my Win32 experience, so I kept trying to interrupt with Ctrl-C.  No.  Bad dog, no treat.  Instead, try Ctrl-Z (pronounced “zed”, ’cause I’m Canadian like that.]

Finding Out if Anyone Else Has Seen This

Vagrant is a pretty popular way of managing virtual machines these days, right?  Yeah.  And while I might be in the first days of the public release of Bash on Windows, there’s been an Insiders Preview going for months, and lots of people banging on the corners.

So what are the odds someone else has tried this too?

Old school: search stackoverflow.com, social.technet.microsoft.com.  No bueno – plenty of folks reporting issues on SO with Bash on Windows, but no one there has reported this vagrant problem.

New school: somehow stumbled across the github repo for BashOnWindows, and dutifully filled out as detailed an issue report as I could muster.

=== NOW HERE’S THE PART THAT BLEW MY MIND ===

A Microsoft employee responded with an intelligent and helpful reply within hours on the same day!!!

(I remember a decade ago, Microsoft’s ‘engagement’ with customers reporting real issues with new software – even when Microsoft’s external bug trackers existed – was abysmal.  You’d be lucky to get an acknowledgement inside a month, and rarely if ever would they bother to update the issue when/if it ever got dispositioned, let alone addressed.  THIS KIND OF RESPONSIVENESS IS AMAZING FROM A CORPORATION.)

Root Issue

My bad, I’d misunderstood the implications of this: WSL (Windows Subsystem for Linux), which supports the user-mode Bash on Ubuntu layer, doesn’t implement any native Linux kernel support.  It’s all user-mode support, and it’s only for non-GUI apps (i.e. things that don’t require Display:0).

Our intrepid Microsoft employee reports here that DKMS isn’t currently supported.  The fact I took it even further to try installing the linux headers was moot; /dev/vboxdrv wouldn’t be available no matter what.

Cleanup in Aisle 4

Did you happen to go down the same road as me?  [What, are you similarly touched in the head?]  If so, here’s what I did to back out of my mess:

  • Performed the lock/install package cleanup specified here
  • Did as clean an uninstall of the linux-headers-generic package as I could (running sudo apt-get –purge remove linux-headers-generic), which outputs
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    The following packages will be REMOVED:
      linux-headers-generic*
    0 upgraded, 0 newly installed, 1 to remove and 47 not upgraded.
    2 not fully installed or removed.
    After this operation, 29.7 kB disk space will be freed.
    Do you want to continue? [Y/n]

    …which leads to the same run-parts script that fails.  Cleanup the locks/install packages again…then pray not enough damage was done by run-parts (in either direction) to matter. [Boy is that a landmine waiting to go off months from now…]

  • Clean uninstall of vagrant (sudo apt-get –purge remove vagrant)…which somehow leads again to these same lines:
    ...
    Do you want to continue? [Y/n]
    (Reading database ... 64370 files and directories currently installed.)
    Removing vagrant (1.4.3-1) ...
    Purging configuration files for vagrant (1.4.3-1) ...
    Processing triggers for man-db (2.6.7.1-1ubuntu1) ...
    Setting up linux-headers-3.13.0-92-generic (3.13.0-92.139) ...
    Examining /etc/kernel/header_postinst.d.
    run-parts: executing /etc/kernel/header_postinst.d/dkms 3.13.0-92-generic /boot/vmlinuz-3.13.0-92-generic

    Ctrl-Z, rm locks and install bits.  [This is getting old.]

  • Clean uninstall of virtualbox (sudo apt-get –purge remove virtualbox)…and once again that unkillable linux-headers setup rears its head.
  • Let’s look closer.
  • Here’s the preamble when removing the vagrant package:
    mike@MIKE-WIN10-SSD:/var/lib/dpkg/updates$ sudo apt-get --purge remove vagrant
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    The following packages were automatically installed and are no longer required:
      bsdtar libarchive13 liblzo2-2 libnettle4 libruby1.9.1 ruby ruby-childprocess
      ruby-erubis ruby-ffi ruby-i18n ruby-log4r ruby-net-scp ruby-net-ssh
      ruby1.9.1
    Use 'apt-get autoremove' to remove them.
    The following packages will be REMOVED:
      vagrant*
    0 upgraded, 0 newly installed, 1 to remove and 47 not upgraded.
    1 not fully installed or removed.
    After this operation, 1,612 kB disk space will be freed.
    Do you want to continue? [Y/n]

    [My italics for emphasis]

  • Maybe – just MAYBE – the “not fully installed” package is linux-headers-generic, and if I could coax apt-get or dpkg to clean *that* up, we’d rid ourselves of this mess.  [*foreshadowing*  …or maybe I just need to find out how to wipe and reinstantiate Bash on Windows…]
  • First, do the suggested cleanup (sudo apt-get autoremove)
  • Then install debfoster and deborphan
  • Debfoster reports nothing interesting, but deborphan reports:
    deborphan: The status file is in an improper state.
    One or more packages are marked as half-installed, half-configured,
    unpacked, triggers-awaited or triggers-pending. Exiting.
  • This article provides a great grep for isolating the issue – here’s what it uncovered:
    Package: linux-headers-3.13.0-92-generic
    Status: install ok half-configured
    --
    Package: dialog
    Status: install ok unpacked
    --
    Package: debfoster
    Status: install ok unpacked
    --
    Package: deborphan
    Status: install ok unpacked
  • sudo dpkg –audit reports:
    The following packages are only half configured, probably due to problems
    configuring them the first time.  The configuration should be retried using
    dpkg --configure <package> or the configure menu option in dselect:
      linux-headers-3.13.0-92-generic Linux kernel headers for version 3.13.0 on 64
  • We already know “retry” isn’t the answer here…
  • sudo dpkg –configure –pending definitely kicks off the dead-end configuration of the headers…what can cause this to back out, or to remove the stuff that keeps getting triggered?
  • As I was about to uninstall Bash for Ubuntu, I (for no reason) ran exit from within the Bash shell, which showed me this new output:
    mike@MIKE-WIN10-SSD:/var/lib/dpkg$ exit
    exit
    run-parts: waitpid: Interrupted system call
    Failed to process /etc/kernel/header_postinst.d at /var/lib/dpkg/info/linux-headers-3.13.0-92-generic.postinst line 110.
    dpkg: error processing package linux-headers-3.13.0-92-generic (--configure):
    subprocess installed post-installation script returned error exit status 4
    Setting up dialog (1.2-20130928-1) …
  • After a few minutes I just killed the window
  • Restarted Bash, didn’t appear to have made any improvements on my situation.

Perhaps it’s time to finally throw in the towel.

Complete rebuild of Bash on Ubuntu

When all else fails, uninstall and reinstall.  Thankfully I hadn’t invested a ton of real work into this…

According to this comment, the following command cleans up the whole deal:

Lxrun /uninstall /full

(Coda: In case you get an error re-installing afterwards, try running this command again.  I happened to end up with error code 0x80070091 for which I could find no help, but others have reported other error codes too.)

Let’s try this again from scratch.

Hope: I discovered the cbwin project is being actively developed, to enable users of Bash on Ubuntu for Win10 to launch Windows binaries from within the bash environment.  I’ll try this for the vagrant/virtualbox combo and report back.

Update

I quickly ran into limits with cbwin in this particular setup, but seemed to have found peace with a hybrid approach.

Found: Minimum permissions to edit Site Columns in SharePoint 2010

Our organization has a requirement to delegate the maintenance of a number of Site Columns (re-used throughout our business application) to colleagues who do not have Site-wide responsibilities or authority.  We didn’t want to simply grant one of the all-powerful permissions/groups (such as Site Collection Administrators, Site Collection Owner, [Site] Owners, or just granting Full Control) as that would leave them open to accusatory questions every time the site or application started behaving badly.  We’d like them to be able to act fairly autonomously, without having to worry that they could cause unintentional damage to the site or application due to excessive permissions.

Knowing that most every Microsoft product has a rich, granular set of DACLs applied at every object level, and having exercised some of the rich permissions available in SharePoint 2010, I was pretty sure there should be a way to combine some permissions together to enable Site Column edits, without giving away the farm.  However I was unable to find any documentation or research that definitively asserted they knew what those permissions were.

Finally I asked my colleague Dale Cox to pair up with me and, together with four hours and an experimental sub-site, we were able to work out a tight sub-set of permissions that (a) definitely allow Site Columns edits to propagate to inheriting Lists and (b) were ratcheted down stepwise until we were comfortable there was no place lower to go.

Permissions: Summary

There are two sets of permissions we tested for: site-level permissions needed to edit an existing Site Column (that is being used as a column in one or more Lists in the SharePoint site), and List-level permissions needed to propagate Site Column changes down to inheriting columns in a List.

The former permissions *only* exist at the site level (even though some permissions in the permissions set are labelled “site permissions” and others “list permissions”) – we’ve found there are *no* permissions that can be managed directly related to Site Columns, but only indirectly via the site-level permissions.

That we also require the latter permissions was surprising at first (once we configured the inheritance, why should we need ongoing permissions on every inheriting List?), but that’s just a design decision that Microsoft’s SharePoint team made.  It means we have to manually *track* all the Lists that inherit changes from Site Columns, and make sure our site column maintenance people have enough permissions to the Lists.

Site-level Permissions

  • Manage Lists (labelled “List Permissions”)
  • View Items (labelled “List Permissions”)
  • Add and Customize Pages (labelled “Site Permissions”)
  • Browse Directories (labelled “Site Permissions”)
  • View Pages (labelled “Site Permissions”)
  • Open (labelled “Site Permissions”)

List-level Permissions

  • Manage Lists (individual permission – which includes View Items, View Pages and Open)
  • Contribute (permission set)

There is one case we did *not* explicitly test for, but found was part of the resultant permissions we arrived at: creating a new Site Column.  While we believe this probably requires only a subset of the site-level (site and/or List) permissions, we didn’t spend the extra cycles to isolate the minimum permissions needed there.  Perhaps it’s implied in the results, and if we need to know we’ll pursue it, but it’s not something we needed to know right now.

Steps We Followed To Distill These Permissions

We had to work as a team to test each permissions combination, as I’d found early on that as soon as I removed myself from the all-powerful groups, I was unable to restore myself or change permissions later on.  I manipulated the permissions, and Dale ran the test cases.  I pre-created the following groups in our experimental site and assigned them specific site-level permissions, to make it easy to switch from one permission set to another:

Group name

Permission Levels assigned to group (custom permissions included)

Approve Approve
Design Design
Manage lists Manage Lists (Manage Lists, View Items, View Pages, Open, Manage Personal Views)
Manage web site Manage Web Site (View Items, Manage Web Site, Add and Customize Pages, Browse Directories, View Pages, Enumerate Permissions, Browse User Information, Open)
Site viewers Contribute, Read, View Only
Edit Site Columns (Add and Customize Pages, Browse Directories, View Pages, Open)

 

  1. Dale’s starting position:
    • Permissions: member of Site Viewers group
    • Result: can view site Libraries and Lists but cannot even load the Site Columns page (/_layouts/mngfield.aspx)
  2. Added Dale to “Manage Lists” group
    • Result: can load Site Settings page but still cannot access Site Columns page
  3. Removed Dale from “Manage Lists” group, added to “Manage web site” group
    • Result: Can load Site Columns page but cannot edit site columns
  4. Granted Dale the Contribute permission on one of the Lists that inherits a Column definition from the Site Column being tested.
    • Result: can still load Site Columns page but still cannot edit the site column being tested
    • Dale even tried just updating the site column definition without propagating the change to the inheriting Lists, but that still didn’t work
    • NOTE: from here onwards we started making a point of trying to separate out the ability to edit the site column definition from the ability for that site column definition to propagate to inheriting lists
  5. Removed Dale from “Manage web site” group, added to “Approve” group
    • Result: Dale cannot load Site Columns page
  6. Added “Manage Web Site” permission to a new permission group called “Approve + MWS”
    • That is, copied the “Approve” permission group, then checked the “Manage Web Site” permission (and left the dependent permissions that came with it)
    • Result: Dale can load the Site Columns page, but cannot commit changes to any Site Columns
  7. Removed Dale from “Approve + MWS”, added Dale to “Manage Hierarchy” group
    • Note: while the difference in included permissions between “Manage Web Site” and “Manage Hierarchy” is large, still the “Manage Hierarchy” permission set is a superset of “Manage Web Site”
    • Result: Dale can access the Site Columns page, but cannot commit changes to Site Columns
    • However, we noticed that Dale can create a new Site Column, and can edit that new Site Column
    • Dale attempted just a non-propagating edit with one of the existing Site Columns, and was successful.
    • Theory: current permissions on the Lists that inherit from the existing Site Columns are what’s blocking the previous Site Column edits
  8. Granted to Dale “Contribute” permission on all of the three Lists which inherit from the existing Site Columns
    • Result: Dale cannot commit a propagating edit on an existing, inherited Site Column
    • Theory: the individual permission “Manage Lists” is what’s needed on each inheriting List for the Site Column propagation to succeed
  9. Granted to Dale “Design” permission (which includes the individual “Manage Lists” permission) to one of the three inheriting Lists
    • Result: Dale could not commit a propagating edit on an existing, inherited Site Column.  Further, the change did not even inherit to the inheriting List (where the permissions needed for propagation could have allowed the change).
    • Theory: propagating edits to a Site Column are a transactional request – if the request doesn’t succeed on every inheriting List, then the edit “transaction” is rolled back, rather than partially succeeding.
  10. Changed the following permissions to all three inheriting Lists: changed from “Design” + “Contribute” to “Manage Lists” + “Contribute”
    • Result: Dale’s propagating edit finally succeeded!
  11. Decided to go back over the site-level permissions to determine whether there was an unknown interaction between the site-level and list-level permissions that we didn’t account for.
  12. Kept the List-level permissions as-is, but changed Dale’s site-level permissions – removed him from “Manage Hierarchy” and added him to “Approve + MWS” group.
    • Result: propagating edit failed, and non-propagating edit failed.
  13. Removed Dale from “Approve + MWS” group, added Dale to Design group (which has Design permissions set).
    • Result: propagating edit succeeded.
  14. Compared the permissions between the “Design” and the “Manage Hierarchy” permissions groups, and found the following eight permissions in common:
    • Add and Customize Pages
    • Browse Directories
    • View Pages
    • Browse User Information
    • Use Remote Interfaces
    • Use Client Integration Features
    • Open
    • Edit Personal User Information
  15. Analysis of these permissions, looking for the “secret sauce” that they share:
    • The “Manage Web Site” site-level individual permission isn’t the one that Dale needs (as it isn’t part of the Design permission set, and yet Dale was still able to propagate the edit when he had the Design permission)
    • Based on the descriptions of each individual permission listed above, the most likely candidate is “Add and Customize Pages”
  16. Created a new Group called “Edit Site Columns”.  Created a new permissions set called “Edit Site Columns”.  To that permissions set, we added only “Add and Customize Pages”.  Removed Dale from Design group, added him to Edit Site Columns group.
    • Note: when “Add and Customize Pages” is checked, the following permissions were automatically checked: “Browse Directories”, “View Pages”, “Open”, “View Items”.
    • Result: propagating and non-propagating edits failed.  Adding a new Site Column also failed.  Dale could access the Site Columns page.
  17. Added back all eight site-level permissions that were in common between the Design and Manage Hierarchy permissions sets.
    • Note: the “Edit Site Columns” permission set at this point doesn’t include any of the List-level permissions that I’ve so far ignored.  Continued failure means that to edit Site Columns, a user needs some of these “list permissions” assigned at the site level.  [Confused yet?]
    • Result: propagating and non-propagating edits failed.  Dale is still able to access the Site Columns page.
  18. Re-examined the “list permissions” that were included in both of the site-level permissions sets (Design and Manage Hierarchy) that were successful for Dale:
    • Manage Lists  –  Create and delete lists, add or remove columns in a list, and add or remove public views of a list. 
    • Override Check Out  –  Discard or check in a document which is checked out to another user. 
    • Add Items  –  Add items to lists and add documents to document libraries. 
    • Edit Items  –  Edit items in lists, edit documents in document libraries, and customize Web Part Pages in document libraries. 
    • Delete Items  –  Delete items from a list and documents from a document library. 
    • View Items  –  View items in lists and documents in document libraries. 
    • Open Items  –  View the source of documents with server-side file handlers. 
    • View Versions  –  View past versions of a list item or document. 
    • Delete Versions  –  Delete past versions of a list item or document. 
    • Create Alerts  –  Create alerts. 
    • View Application Pages  –  View forms, views, and application pages. Enumerate lists. 
  19. Added all of the above permissions to the “Edit Site Columns” permission set except View Items.
    • Result: propagating edits were successful.
  20. Stripped down the site permissions from the “Edit Site Columns” permission set to leave only the following: ACP, Browse Directories, View Pages, Open.
    • Result: propagating edits were successful.
  21. Stripped out a number of List-level permissions from the “Edit Site Columns” permission set to leave only the following: ML, VI.  Left the site permissions as in the last step.
    • Result: propagating edits were successful.
  22. Wanted to ensure this result is reproducible, so we created a new permissions set “Edit Site Columns – Confirmation”, added the same permissions to this new set (i.e. ML, VI, ACP, BD, VP, Open),  created new group “Edit Site Columns – Confirmation”, granted the new permission set to the new group, added Dale to the new group and removed him from the older “Edit Site Columns” group.
    • Result: all operations were successful – view Site Columns page, commit propagating and non-propagating edit of inherited Site Columns, add new Site Column.

Lenovo T61? High CPU usage? Look for bad drivers (…spewing hardware interrupts)

For a couple of months, I noticed some pretty unusual CPU utilization on my Lenovo T61 Thinkpad (running Windows XP SP3).  I run Process Explorer (ProcExp) in the background all day, to keep an eye out for unusual system performance behaviour, and lately I’ve noticed that almost an entire “core” on my Intel dual-core system is consumed by something running in Kernel space.

notificationAreaI have ProcExp configured to display the CPU meter in the “systray” (aka notification area), and it’s really easy to know when there’s a kernel-consumer on your system – that’s when the CPU utilization graph will show a sizeable proportion of the line graph in red, rather than its more typical green (for User space consumers).

procexp Pop open ProcExp under this condition, and if you’ve spent as much time watching these behaviour profiles as I have, and you’ll be as surprised as I was to find the “Interrupts” process (really, just an abstract aggregate) reporting CPU utilization in the 30’s or 40’s.

 

Summary of the solution: bad drivers, needed to be updated.  In my case it was a network driver that had just been updated by the vendor; for others the culprits seem to be video drivers.  I’m sure other failures exist, but these are the two most common.

Step 1: Device Manager

When hardware and/or drivers are acting up like this, check to make sure your drivers are loading properly in Windows:

  • Open the Device Manager control panel applet (e.g. run “devmgmt.msc”)
  • Look at the list to see if any categories are expanded
  • In every expanded category, examine the details of any device that has a yellow exclamation mark (indicating there is some problem with the driver).  If at all possible, resolve the issue that’s been identified.

Until all driver issues have been resolved, any other suggestions you follow below could be moot.  If you really can’t fix the problem(s) that Device Manager is presenting, but want to try other approaches, then there are a couple of tactical workarounds you may wish to try:

  1. Install a newer (or older) version of the driver, to see if it’s just the current driver version that’s having problems.
  2. Disable the device temporarily, and make sure you are still seeing the excessive hardware interrupts being generated for a sustained period.  (If not, then perhaps that was the offending driver, and perhaps you’ll be able to run with the device disabled while the vendor addresses the issue in a later version of the driver.  Lucky you.)

Step 2: KernRate

Microsoft has an interesting little package, developed for use on Windows XP, to dig into a little detail on the Kernel-related activity on the system.  It’s not exactly a wizard I’d send to my mom, but it’s helped at least narrow the potential issues to track down for me.

You can download KernRate from here.

2.1 Setup & Config for a valid KernRate result

  • Install Windows Debugging Tools
  • Install KernRate
  • Configure valid symbols PATH variable – see this thread for details
    • Create a new directory C:\Symbols
    • Create a new environment variable by clicking on: Start menu > Control Panel > System > Advanced > Environment Variables > click New under “user variables for …”, then type:
      • Variable name: _NT_SYMBOL_PATH
      • Variable value:
  • Copy DBGHELP.DLL and SYMSRV.DLL from the install location for Windows Debugging Tools (usually C:\Program Files\Debugging Tools for Windows (x86)) to the install location for KernRate (usually C:\Program Files\KrView\Kernrates)

2.2 Use KernRate to profile kernel activity

Once it’s installed and configured correctly, you can just fire up Kernrate_i386_xp.exe, let it run for a few minutes (during the period when the Kernel activity is not what you’d expect), and then hit [Ctrl]-C to get the results of the Kernrate run.

Here’s a typical output from my system during normal operating conditions  (NOTE: for every Kernrate output, I’ve left out the entries from the final table that have no significant “hits” i.e. no significant volume of kernel events):

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe
/==============================\

\==============================/
Date: 2009/06/09   Time: 13:50:38
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:11.296 (17.7%)  U 0:00:06.171 ( 9.7%)  I 0:00:46.328 (72.6%)  DPC
0:00:00.078 ( 0.1%)  Interrupt 0:00:00.031 ( 0.0%)
       Interrupts= 76929, Interrupt Rate= 1206/sec.

P1     K 0:00:12.328 (19.3%)  U 0:00:07.718 (12.1%)  I 0:00:43.750 (68.6%)  DPC
0:00:00.109 ( 0.2%)  Interrupt 0:00:00.531 ( 0.8%)
       Interrupts= 15687, Interrupt Rate= 246/sec.

TOTAL  K 0:00:23.625 (18.5%)  U 0:00:13.890 (10.9%)  I 0:01:30.078 (70.6%)  DPC
0:00:00.187 ( 0.1%)  Interrupt 0:00:00.562 ( 0.4%)
       Total Interrupts= 92616, Total Interrupt Rate= 1452/sec.

Total Profile Time = 63796 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       184655872,       180600832,        -4055
040
    Available Pagefile(s)       ,      1075990528,      1069088768,        -6901
760
    Available Virtual           ,      2132107264,      2131058688,        -1048
576
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,       949736,         14887/sec.
    System Calls         ,      2754324,         43173/sec.
    Page Faults          ,       799583,         12533/sec.
    I/O Read Operations  ,         5382,         84/sec.
    I/O Write Operations ,         4657,         73/sec.
    I/O Other Operations ,        92064,         1443/sec.
    I/O Read Bytes       ,     15853022,         2946/ I/O
    I/O Write Bytes      ,      1968451,         423/ I/O
    I/O Other Bytes      ,      9179346,         100/ I/O

—————————–

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 229
Percentage in the following table is based on the Total Hits for the Kernel

Time   11604 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               9041      63796    77 %     3542933
hal                                    1155      63796     9 %      452614
win32k                                  604      63796     5 %      236691
mfehidk                                 179      63796     1 %       70145
intelppm                                163      63796     1 %       63875
igxpmp32                                112      63796     0 %       43889
HIPK                                     98      63796     0 %       38403 
……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

And here’s the output from my system during one of these CPU over-consumption cases:

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe
/==============================\

\==============================/
Date: 2009/06/10   Time:  8:41:26
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
Starting to collect profile data

***> Press ctrl-c to finish collecting profile data
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:02:26.906 (80.8%)  U 0:00:21.640 (11.9%)  I 0:00:13.328 ( 7.3%)  DPC
0:00:01.078 ( 0.6%)  Interrupt 0:01:18.406 (43.1%)
       Interrupts= 3919032, Interrupt Rate= 21548/sec.

P1     K 0:02:14.000 (73.7%)  U 0:00:32.703 (18.0%)  I 0:00:15.171 ( 8.3%)  DPC
0:00:00.640 ( 0.4%)  Interrupt 0:01:02.890 (34.6%)
       Interrupts= 3837550, Interrupt Rate= 21100/sec.

TOTAL  K 0:04:40.906 (77.2%)  U 0:00:54.343 (14.9%)  I 0:00:28.500 ( 7.8%)  DPC
0:00:01.718 ( 0.5%)  Interrupt 0:02:21.296 (38.8%)
       Total Interrupts= 7756582, Total Interrupt Rate= 42648/sec.

Total Profile Time = 181875 msec

                                       BytesStart          BytesStop         By
esDiff.
    Available Physical Memory   ,       564551680,       571240448,         668
768
    Available Pagefile(s)       ,      2194202624,      2233167872,        3896
248
    Available Virtual           ,      2132402176,      2131353600,        -104
576
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,      1764882,         9704/sec.
    System Calls         ,     14821228,         81491/sec.
    Page Faults          ,       188084,         1034/sec.
    I/O Read Operations  ,        31768,         175/sec.
    I/O Write Operations ,        27985,         154/sec.
    I/O Other Operations ,       321733,         1769/sec.
    I/O Read Bytes       ,     24755975,         779/ I/O
    I/O Write Bytes      ,     14122756,         505/ I/O
    I/O Other Bytes      ,     31150904,         97/ I/O

—————————–
Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   62240 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                              40159     181875    64 %     5520137
hal                                   10990     181875    17 %     1510652
igxpmp32                               2381     181875     3 %      327285
HECI                                   2170     181875     3 %      298281
win32k                                 1495     181875     2 %      205498
intelppm                               1335     181875     2 %      183505
mfehidk                                 931     181875     1 %      127972
HIPK                                    660     181875     1 %       90721
Ntfs                                    361     181875     0 %       49621
usbuhci                                 285     181875     0 %       39175
PGPwded                                 223     181875     0 %       30652

……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================
 

 

2.3 KernRate detailed profiling on individual modules

So what I’m seeing is (like many other folks on the ‘Net) that the NTKRNLPA module is consuming the lions’ share of the CPU (being consumed by Kernel-space threads).  Following some of the threads on the Sysinternals message boards, I dug into the details of what images were being loaded by NTKRNLPA during these kernel events:

2.3.1 NTKRNLPA

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -s 5 -z ntkrnlpa
/==============================\

\==============================/
Date: 2009/06/10   Time:  8:49:42
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -s 5 -z ntkrnlpa

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
CallBack: Finished Attempt to Load symbols for 804d7000 \WINDOWS\system32\ntkrnlpa.exe

Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:01.984 (39.7%)  U 0:00:00.406 ( 8.1%)  I 0:00:02.609 (52.2%)  DPC
0:00:00.031 ( 0.6%)  Interrupt 0:00:01.421 (28.4%)
       Interrupts= 97650, Interrupt Rate= 19530/sec.

P1     K 0:00:02.125 (42.5%)  U 0:00:00.453 ( 9.1%)  I 0:00:02.421 (48.4%)  DPC
0:00:00.062 ( 1.3%)  Interrupt 0:00:01.406 (28.1%)
       Interrupts= 95410, Interrupt Rate= 19082/sec.

TOTAL  K 0:00:04.109 (41.1%)  U 0:00:00.859 ( 8.6%)  I 0:00:05.031 (50.3%)  DPC
0:00:00.093 ( 0.9%)  Interrupt 0:00:02.828 (28.3%)
       Total Interrupts= 193060, Total Interrupt Rate= 38612/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       563130368,       560410624,        -2719
744
    Available Pagefile(s)       ,      2230300672,      2226749440,        -3551
232
    Available Virtual           ,      2132402176,      2104180736,       -28221
440
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        73005,         14601/sec.
    System Calls         ,       214633,         42927/sec.
    Page Faults          ,         3814,         763/sec.
    I/O Read Operations  ,         1175,         235/sec.
    I/O Write Operations ,          433,         87/sec.
    I/O Other Operations ,         4569,         914/sec.
    I/O Read Bytes       ,       776604,         661/ I/O
    I/O Write Bytes      ,       129278,         299/ I/O
    I/O Other Bytes      ,       575190,         126/ I/O

—————————–
Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1762 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1307       4984    74 %     6555979
hal                                     246       4984    13 %     1233948
HECI                                     44       4984     2 %      220706
igxpmp32                                 38       4984     2 %      190609
intelppm                                 36       4984     2 %      180577
win32k                                   33       5000     1 %      165000
mfehidk                                  15       5000     0 %       75000
HIPK                                     12       5000     0 %       60000
……

===> Processing Zoomed Module ntkrnlpa.exe…

—– Zoomed module ntkrnlpa.exe (Bucket size = 16 bytes, Rounding Down) ——-

Percentage in the following table is based on the Total Hits for this Zoom Module

Time   1307 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
KiDispatchInterrupt                     787       4984    46 %     3947632
READ_REGISTER_USHORT                    383       4984    22 %     1921147
READ_REGISTER_ULONG                     342       4984    20 %     1715489
READ_REGISTER_UCHAR                      41       4984     2 %      205658
KeReleaseInStackQueuedSpinLockFromDpcLevel         13       4984     0 %       65208
ZwYieldExecution                         10       4984     0 %       50160

……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================
 

2.3.2 HAL

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -v -s 5 -z hal
/==============================\

\==============================/
Date: 2009/06/10   Time: 18:25:00
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -v -s 5 -z hal

KERNRATE: current IMAGEHLP SymOptions: UNDNAME DEBUG

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
KERNRATE: IMAGEHLP symbol search path is: .;symsrv*symsrv.dll*c:\symbols*
http://
msdl.microsoft.com/download/symbols;C:\WINDOWS\System32\Drivers;C:\WINDOWS\System32;C:\WINDOWS
CallBack: Loading symbols for hal.dll…
DBGHELP: No header for hal.dll.  Searching for image on disk
DBGHELP: C:\Program Files\KrView\Kernrates\hal.dll – file not found
DBGHELP: hal.dll not found in .
DBGHELP: hal.dll not found in C:\WINDOWS\System32\Drivers
DBGHELP: C:\WINDOWS\System32\hal.dll – OK
DBGHELP: C:\WINDOWS\System32\hal.dll found
DBGHELP: .\halmacpi.pdb – file not found
DBGHELP: .\dll\halmacpi.pdb – file not found
DBGHELP: .\symbols\dll\halmacpi.pdb – file not found
CallBack: Finished Attempt to Load symbols for 806e4000 \WINDOWS\system32\hal.dll

DBGHELP: hal – public symbols
         c:\symbols\halmacpi.pdb\9875FD697ECA4BBB8A475825F6BF885E1\halmacpi.pdb
Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:02.531 (50.6%)  U 0:00:00.453 ( 9.1%)  I 0:00:02.015 (40.3%)  DPC
0:00:00.000 ( 0.0%)  Interrupt 0:00:01.890 (37.8%)
       Interrupts= 119072, Interrupt Rate= 23814/sec.

P1     K 0:00:02.609 (52.2%)  U 0:00:00.656 (13.1%)  I 0:00:01.734 (34.7%)  DPC
0:00:00.031 ( 0.6%)  Interrupt 0:00:01.437 (28.7%)
       Interrupts= 114272, Interrupt Rate= 22854/sec.

TOTAL  K 0:00:05.140 (51.4%)  U 0:00:01.109 (11.1%)  I 0:00:03.750 (37.5%)  DPC
0:00:00.031 ( 0.3%)  Interrupt 0:00:03.328 (33.3%)
       Total Interrupts= 233344, Total Interrupt Rate= 46669/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       111005696,       117772288,         6766
592
    Available Pagefile(s)       ,      1635250176,      1639911424,         4661
248
    Available Virtual           ,      2132402176,      2131013632,        -1388
544
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        82262,         16452/sec.
    System Calls         ,       292287,         58457/sec.
    Page Faults          ,         6431,         1286/sec.
    I/O Read Operations  ,         1592,         318/sec.
    I/O Write Operations ,          506,         101/sec.
    I/O Other Operations ,         7687,         1537/sec.
    I/O Read Bytes       ,       344208,         216/ I/O
    I/O Write Bytes      ,       457023,         903/ I/O
    I/O Other Bytes      ,       951393,         124/ I/O

—————————–

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1761 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1223       5000    69 %     6115000
hal                                     258       5000    14 %     1290000
intelppm                                 74       5000     4 %      370000
HECI                                     55       5000     3 %      275000
win32k                                   32       5000     1 %      160000
igxpmp32                                 29       5000     1 %      145000
HIPK                                     22       5000     1 %      110000
mfehidk                                  15       5000     0 %       75000
……

===> Processing Zoomed Module hal.dll…

—– Zoomed module hal.dll (Bucket size = 16 bytes, Rounding Down) ——–
Percentage in the following table is based on the Total Hits for this Zoom Module

Time   258 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
READ_PORT_UCHAR                         143       5000    25 %      715000
READ_PORT_ULONG                         139       5000    24 %      695000
READ_PORT_USHORT                        139       5000    24 %      695000
HalpQueryPerformanceCounter              24       5000     4 %      120000
KeReleaseQueuedSpinLock                  21       5000     3 %      105000
KeAcquireInStackQueuedSpinLock           21       5000     3 %      105000
KfLowerIrql                               8       5000     1 %       40000
ExAcquireFastMutex                        7       5000     1 %       35000
KeGetCurrentIrql                          7       5000     1 %       35000
HalpGetCmosCenturyByte                    7       5000     1 %       35000
HalpReleaseSystemHardwareSpinLock          7       5000     1 %       35000
HalBeginSystemInterrupt                   6       5000     1 %       30000
KeAcquireQueuedSpinLock                   5       5000     0 %       25000
……

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

2.3.3 IGXPMP32

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -v -s 5 -z igxpmp32
/==============================\

\==============================/
Date: 2009/06/10   Time: 18:25:11
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -v -s 5 -z igxpmp32

KERNRATE: current IMAGEHLP SymOptions: UNDNAME DEBUG

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
KERNRATE: IMAGEHLP symbol search path is: .;symsrv*symsrv.dll*c:\symbols*
<a href="http://
“>“>”>”>”>”>”>http://
msdl.microsoft.com/download/symbols;C:\WINDOWS\System32\Drivers;C:\WINDOWS\System32;C:\WINDOWS
CallBack: Loading symbols for igxpmp32.sys…
DBGHELP: No header for igxpmp32.sys.  Searching for image on disk
DBGHELP: C:\Program Files\KrView\Kernrates\igxpmp32.sys – file not found
DBGHELP: igxpmp32.sys not found in .
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.sys – OK
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.sys found
DBGHELP: .\igxpmp32.pdb – file not found
DBGHELP: .\sys\igxpmp32.pdb – file not found
DBGHELP: .\symbols\sys\igxpmp32.pdb – file not found
SYMSRV:  The server name or address could not be resolved
SYMSRV:  The server name or address could not be resolved
SYMSRV:  c:\symbols\igxpmp32.pdb\01C98FF4DD054B6AB11509CF9CF04F171\igxpmp32.pdb
not found
SYMSRV: 
http://msdl.microsoft.com/download/symbols/igxpmp32.pdb/01C98FF4DD054B6
AB11509CF9CF04F171/igxpmp32.pdb not found
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\symbols\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\symbols\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\symbols\sys\igxpmp32.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\igxpmp32.pdb – file not found
DBGHELP: D:\ccViews\autobuild1_BR-0810-1787_14.36_Snapshot\gfx_Development\dump3
2\igfx\xp\gtune\AIM3Lib\igxpmp32.pdb – file not found
CallBack: Finished Attempt to Load symbols for b942f000 \SystemRoot\system32\DRI
VERS\igxpmp32.sys

DBGHELP: igxpmp32 – export symbols
Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:02.812 (56.1%)  U 0:00:00.265 ( 5.3%)  I 0:00:01.937 (38.6%)  DPC
0:00:00.000 ( 0.0%)  Interrupt 0:00:01.812 (36.1%)
       Interrupts= 114753, Interrupt Rate= 22879/sec.

P1     K 0:00:02.937 (58.8%)  U 0:00:00.562 (11.3%)  I 0:00:01.500 (30.0%)  DPC
0:00:00.062 ( 1.3%)  Interrupt 0:00:01.578 (31.6%)
       Interrupts= 109952, Interrupt Rate= 21990/sec.

TOTAL  K 0:00:05.750 (57.4%)  U 0:00:00.828 ( 8.3%)  I 0:00:03.437 (34.3%)  DPC
0:00:00.062 ( 0.6%)  Interrupt 0:00:03.390 (33.9%)
       Total Interrupts= 224705, Total Interrupt Rate= 44941/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       119107584,       107962368,       -11145
216
    Available Pagefile(s)       ,      1640800256,      1629646848,       -11153
408
    Available Virtual           ,      2132402176,      2102800384,       -29601
792
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        78959,         15792/sec.
    System Calls         ,       301466,         60293/sec.
    Page Faults          ,         8704,         1741/sec.
    I/O Read Operations  ,          711,         142/sec.
    I/O Write Operations ,          197,         39/sec.
    I/O Other Operations ,         5430,         1086/sec.
    I/O Read Bytes       ,      1212541,         1705/ I/O
    I/O Write Bytes      ,        14701,         75/ I/O
    I/O Other Bytes      ,      1486190,         274/ I/O

—————————–

Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1764 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1232       5000    69 %     6160000
hal                                     235       5000    13 %     1175000
intelppm                                 93       5015     5 %      463609
HECI                                     42       5015     2 %      209371
win32k                                   34       5000     1 %      170000
igxpmp32                                 32       5015     1 %      159521
mfehidk                                  23       5000     1 %      115000
HIPK                                     21       5000     1 %      105000
kmixer                                   11       5000     0 %       55000
KSecDD                                   10       5015     0 %       49850
……

===> Processing Zoomed Module igxpmp32.sys…

—– Zoomed module igxpmp32.sys (Bucket size = 16 bytes, Rounding Down) ——-

Percentage in the following table is based on the Total Hits for this Zoom Module

Time   32 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
xc_copp_agent                            30       5015    100 %      149551

================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

2.3.4 HECI

C:\Program Files\KrView\Kernrates>Kernrate_i386_XP.exe -v -s 5 -z heci
/==============================\

\==============================/
Date: 2009/06/10   Time: 18:48:17
Machine Name: MIKESL-MOBL1
Number of Processors: 2
PROCESSOR_ARCHITECTURE: x86
PROCESSOR_LEVEL: 6
PROCESSOR_REVISION: 0f0b
Physical Memory: 2007 MB
Pagefile Total: 3856 MB
Virtual Total: 2047 MB
PageFile1: \??\C:\pagefile.sys, 2006MB
OS Version: 5.1 Build 2600 Service-Pack: 3.0
WinDir: C:\WINDOWS

Kernrate User-Specified Command Line:
Kernrate_i386_XP.exe -v -s 5 -z heci

KERNRATE: current IMAGEHLP SymOptions: UNDNAME DEBUG

Kernel Profile (PID = 0): Source= Time,
Using Kernrate Default Rate of 25000 events/hit
KERNRATE: IMAGEHLP symbol search path is: .;symsrv*symsrv.dll*c:\symbols*
<a href="http://
“>“>”>”>”>”>”>http://
msdl.microsoft.com/download/symbols;C:\WINDOWS\System32\Drivers;C:\WINDOWS\System32;C:\WINDOWS
CallBack: Loading symbols for HECI.sys…
DBGHELP: No header for HECI.sys.  Searching for image on disk
DBGHELP: C:\Program Files\KrView\Kernrates\HECI.sys – file not found
DBGHELP: HECI.sys not found in .
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.sys – OK
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.sys found
DBGHELP: .\HECI.pdb – file not found
DBGHELP: .\sys\HECI.pdb – file not found
DBGHELP: .\symbols\sys\HECI.pdb – file not found
SYMSRV:  The server name or address could not be resolved
SYMSRV:  The server name or address could not be resolved
SYMSRV:  c:\symbols\HECI.pdb\0A57FC20FA904FB8BAEC77C53EE39C821\HECI.pdb not found
SYMSRV: 
http://msdl.microsoft.com/download/symbols/HECI.pdb/0A57FC20FA904FB8BAE
C77C53EE39C821/HECI.pdb not found
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\symbols\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\symbols\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\symbols\sys\HECI.pdb – file not found
DBGHELP: C:\WINDOWS\System32\Drivers\HECI.pdb – file not found
DBGHELP: d:\ccviews\autobuild1_br-0707-1049_2.6_snapshot\amt_development\sw\src\
drivers\heci\driver\objfre_wxp_x86\i386\HECI.pdb – file not found
CallBack: Finished Attempt to Load symbols for ba308000 \SystemRoot\system32\DRI
VERS\HECI.sys

DBGHELP: HECI – no symbols loaded
Starting to collect profile data

Will collect profile data for 5 seconds
===> Finished Collecting Data, Starting to Process Results

————Overall Summary:————–

P0     K 0:00:02.296 (45.9%)  U 0:00:00.343 ( 6.9%)  I 0:00:02.359 (47.2%)  DPC
0:00:00.000 ( 0.0%)  Interrupt 0:00:01.359 (27.2%)
       Interrupts= 115350, Interrupt Rate= 23070/sec.

P1     K 0:00:02.203 (44.1%)  U 0:00:00.531 (10.6%)  I 0:00:02.265 (45.3%)  DPC
0:00:00.015 ( 0.3%)  Interrupt 0:00:01.093 (21.9%)
       Interrupts= 110550, Interrupt Rate= 22110/sec.

TOTAL  K 0:00:04.500 (45.0%)  U 0:00:00.875 ( 8.8%)  I 0:00:04.625 (46.3%)  DPC
0:00:00.015 ( 0.2%)  Interrupt 0:00:02.453 (24.5%)
       Total Interrupts= 225900, Total Interrupt Rate= 45180/sec.

Total Profile Time = 5000 msec

                                       BytesStart          BytesStop         Byt
esDiff.
    Available Physical Memory   ,       116039680,       114077696,        -1961
984
    Available Pagefile(s)       ,      1642790912,      1640251392,        -2539
520
    Available Virtual           ,      2132402176,      2105364480,       -27037
696
    Available Extended Virtual  ,               0,               0,
  0

                                  Total      Avg. Rate
    Context Switches     ,        91187,         18237/sec.
    System Calls         ,       270278,         54056/sec.
    Page Faults          ,         6482,         1296/sec.
    I/O Read Operations  ,          269,         54/sec.
    I/O Write Operations ,           41,         8/sec.
    I/O Other Operations ,         5361,         1072/sec.
    I/O Read Bytes       ,      1551742,         5769/ I/O
    I/O Write Bytes      ,         2836,         69/ I/O
    I/O Other Bytes      ,      2381876,         444/ I/O

—————————–
Results for Kernel Mode:
—————————–

OutputResults: KernelModuleCount = 227
Percentage in the following table is based on the Total Hits for the Kernel

Time   1754 hits, 25000 events per hit ——–
Module                                Hits   msec  %Total  Events/Sec
ntkrnlpa                               1273       5000    72 %     6365000
hal                                     222       5000    12 %     1110000
intelppm                                 70       5000     3 %      350000
HECI                                     47       5000     2 %      235000
igxpmp32                                 34       5000     1 %      170000
win32k                                   30       5000     1 %      150000
mfehidk                                  14       5000     0 %       70000
HIPK                                     12       5000     0 %       60000
KSecDD                                   12       5000     0 %       60000
……

===> Processing Zoomed Module HECI.sys…

Symbol Enumeration failed (or no symbols found) on module HECI in CreateZoomedModuleList, Error Code= 0x000001e7
No Hits or No symbols found for module HECI.sys
================================= END OF RUN ==================================
============================== NORMAL END OF RUN ==============================

 

Step 3: Analysis & Research

In my case, it appears that there’s some process (or processes) that’s calling the RtlIpv6StringToAddressA kernel API an excessive number of times.  From what I’ve seen in my research, not only is this a common complaint, but it’s been reported by at least two other T61 users in the last year or so – without any definitive solution.

And what’s this RtlIpv6StringToAddressA API do exactly?  Well, it translates a string form of an IPv6 address to a more strongly-typed form of the address.  Here’s what I know (and what I don’t) about IPv6 in my T61 and Windows XP SP3:

  • The “teredo” driver hasn’t been installed in Windows XP SP3 (“teredo” is the code name used by the Microsoft Windows team that developed the IPv6 stack)
  • There are a number of third-party packages on my system, but the newest to have been introduced are:
    • the AMT 2.6-supporting software and driver (v “2.6.30.1014 built by: WinDDK”, driver date 7/12/2007)
    • the PGP whole disk encryption package
    • the Intel 4965AGN wireless driver (v12.4.0.21), installed earlier
    • the Intel PROSet/Wireless Wifi Software (v12.1.1.0), installed earlier
  • Knowing that AMT is meant to provide a remote networking interface to Intel chipset functionality, and knowing that this problem didn’t occur before the AMT & PGP installs took place, my primary suspect (as the source of these excessive IPv6-related API calls) becomes the AMT software.
  • However, I consulted with colleagues on my team that are intimately familiar with the internals of HECI.sys, and they have confirmed that it doesn’t call the RtlIpv6StringToAddressA() Win32 API
  • That prompted me to search the strings embedded in all the drivers and libraries I could find in C:\Windows\System32.  Here were the files where FINDSTR.EXE could find that API call:
    • dnsapi.dll
    • mswsock.dll
    • ntdll.dll
    • ntkrnlpa.exe
    • ntoskrnl.exe
    • Drivers\http.sys

Disabled the following Devices (which didn’t help):

  • Ricoh Memory Stick Controller
  • Ricoh SD/MMC Host Controller
  • Ricoh xD-Picture Card Controller
  • OHCI Compliant IEEE 1394 Host Controller
  • ThinkPad Modem
  • Ricoh R/RL/5C476(II) or Compatible CardBus Controller
  • SDA Standard Compliant SD Host Controller

Downloaded and installed the following driver updates:

  • bluetooth v5.5.09.6400
  • Ultranav v11.1.21.2
  • Power Manager v1.60
  • ACPI Power Management v1.53
  • Integrated camera v5.8.49004.0
  • Hotkey driver v2.20.0001

Step 4: Make sure you try later and earlier driver versions

Bottom line: I replaced the network driver on my system with one that appeared *months* after I started down this road, and I finally stopped seeing these CPU spews occur.

I was lucky – the vendor dropped a new driver version that happened to resolve this error.  Until that point, I was stuck – nothing I could do to my OS to resolve it.

And if your vendor hasn’t released a *later* driver that eliminates the issue, I’ll recommend something radical – try an *earlier* version of the driver (e.g. like these folks here).  Sometimes programmers introduce errors that didn’t exist in previous versions of their code, even if they didn’t intend to affect such a broad issue.  (Especially when they didn’t intend to.)  You might find a 6-, 12- or 24-month-old driver that still works well enough (for the time being) and doesn’t exhibit the performance drag you’re seeing.  What’s worse – an older driver that works but doesn’t have the latest features, or a driver that has the features but also craters your system performance?

Last resort: you may find that disabling the device in the OS will finally shut up a misbehaving driver (i.e. that the driver won’t load, so it won’t spawn all these excessive interrupts).  This isn’t a great long-term solution, and it’s only viable for hardware you don’t need to use (e.g. a biometric device such as fingerprint reader), but it’s one last thing to consider before junking your whole system.

Notes: related articles

Sysinternals Forums

  1. Hardware interrupts using excessive CPU
  2. Kernrate trace + RtlIpv6StringToAddressA (?)

Lenovo Forums

  1. x200 Hardware Interrupts constantly at 40% of CPU resources
  2. X200: hardware interrupts causes cpu load?
  3. T61: Problems with Hardware Interrupts when on battery
  4. Problem on S12 with flood of interrupts from Broadcom wireless-G after Windows resume

ComponentChk.cpp – I resolved my first C++ compiler "fatal error"!

Yes, for those of you with even more grey in your beard/hair than I, this is probably a “meh” event.  However, for me, who’s always looked at C++ code and wondered how the H*** anyone could ever make sense of it, this is a big event.  [Patting myself on the back.]

Situation

I’m working on finishing up the Setup bootstrapper for my VSTO add-in Word2MediaWiki.NET.  I’m targeting Word 2003 and Word 2007 (despite Microsoft’s best efforts/worst neglect to the contrary), and I’m trying to achieve what many have done before, but which seemed impossible for me: allow the Setup.exe to detect which version of Word is installed, and then conditionally install the prerequisite software (i.e. the so-called Office PIAs) for whichever version of Word is discovered.

Problem 1: most folks in the VSTO space seem to think that, despite the fact that a version of the .NET framework is required to support VSTO, the Setup.exe that goes with your VSTO add-in install package should use UNmanaged code to detect whether the pre-requisites are installed.

Problem 2: I know squat about writing unmanaged code.

Problem 3: Microsoft’s VSTO & Office teams left it up as an exercise to the VSTO app developer to figure out how to assemble the amazing number of parts necessary to make a VSTO install work transparently to the end user.

Problem 4: There’ve been many articles written posthumously (I mean, long after VSTO v2 was released) on various aspects of automating VSTO-add-in installation, but none of them addressed the very inevitable scenario where the app developer needs to support both Word 2003 and Word 2007.  [Don’t even *think* about pre-2003 versions of Office — you’d have to be clinically insane.]

Manna from heaven

One ridiculously brave soul, Mark Hogan, actually took the time to not only figure out how to build such a conditional pre-requisite detection app in C++, but he was so overcome with glee that he beat Microsoft at something even they were too scared to do, that he published the full set of source code and XML configuration files for the entire world to use.

Now, masochist that I am, I took it upon myself to try to integrate the code that Mark Hogan published into the existing code that I’d already slotted into my Office PIA bootstrapper files.  However, I didn’t anticipate the foreseeable result that, because I’m coding this stuff in my spare time, and usually late at night, I would (a) not finish the integration work in one sitting, (b) forget what I’d originally set out to do and (c) forget to integrate any of the critical code that actually performed the conditional logic.

Mike chases his tail

Miraculously, I was left with a .CPP file that would compile and that appeared to be significantly different from the original file I started from, and that threw off an error code that created a spectacular wild goose-chase:

“Unable to satisfy all prerequisites for Word2MediaWikiDotNET.  Setup cannot continue until all system components have been successfully installed.”

Details:

“Prerequisite check for system component Microsoft Office 2003/2007 Primary Interop Assemblies (Word Only) failed.

See the setup log file located at ‘C:\DOCUME~1\msmithlo\LOCALS~1\Temp\VSD33.tmp\install.log’ for more information.”

And in the INSTALL.LOG file was the illusory answer:

Running external check with command ‘C:\DOCUME~1\msmithlo\LOCALS~1\Temp\VSD33.tmp\Office2003PIAor2007_WordOnly\ComponentChk.exe’ and parameters ‘/delete /save /word {1EBDE4BC-9A51-4630-B541-2561FA45CCC5}’

“Process exited with code 1607”

Setting value ‘1607 {int}’ for property ‘SupportedWordVersionInstalled’

Setting value ‘1607 {int}’ for property ‘PIAsRegistered’

This led me down the path of chasing InstallShield errors, since the only Google search results that looked at all related were things like these.  After a few days of trying to figure out how the InstallShield scripting engine could be related to a Visual Studio SETUP.EXE or custom C++ application, I finally got my brain back and tried to isolate where in the code or configuration files the problem existed.  That led me to a line of C++ code that threw ERROR_UNKNOWN_COMPONENT, which as it turns out also shares the 1607 error/exit value.

[The whole gruesome story is illustrated in the Bug I filed to myself here.]

Back to the original goal, with a new mission

Once I realized what I’d left out of the C++ code, I quickly got it to the point where I could try compiling it.  Now a new mission emerged – how to debug a C++ compiler error?

C:\>cl.exe /Oxs /MT /GS ComponentChk.cpp advapi32.lib
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.21022.08 for 80×86
Copyright (C) Microsoft Corporation.  All rights reserved.

ComponentChk.cpp
Microsoft (R) Incremental Linker Version 9.00.21022.08
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:ComponentChk.exe
ComponentChk.obj
advapi32.lib
ComponentChk.obj : error LNK2019: unresolved external symbol __imp__WaitForInputIdle@8 referenced in function “int __cdecl MyStart(char const *,char const *)” (?MyStart@@YAHPBD0@Z)
ComponentChk.exe : fatal error LNK1120: 1 unresolved externals

That command line (cl.exe /Oxs /MT /GS ComponentChk.cpp advapi32.lib) was derived from the original article from which we were all working (or stumbling around half-blindly, it felt to me).  I just ran this without a second thought, assuming that since Mark Hogan didn’t seem to mention any modifications to it, any errors it showed must’ve been my fault.

Where is “__imp__WaitForInputIdle@8” I wondered?  It didn’t show up in the source code when I searched for it, and nor did “int __cdecl MyStart“.

After staring at this for a while longer, I figured some substring searches would work, which after a few attempts finally showed me that I was actually looking for the code

WaitForInputIdle(pi.hProcess, INFINITE);

which is called in the function MyStart().  I tried some silly things first, of course, but I eventually realized that if WaitForInputIdle() didn’t exist as a named function in the source code, perhaps it existed in another library that wasn’t yet connected to the code?  A quick MSDN Library search told me this function was actually part of USER32.DLL, and it wasn’t too painful a leap of logic to try adding USER32.LIB to the compilation parameters, like so:

cl.exe /Oxs /MT /GS ComponentChk.cpp advapi32.lib user32.lib

And when I loaded up the now-successfully-compiled COMPONENTCHK.EXE into DEPENDS.EXE, it confirmed exactly what I expected to see:

image

Mark, many thanks to you, and my apologies for calling into question your mad C++ skillz.

Debugging a Word 2003 runaway thread…but not successfully

I just experienced one of the usual “hangs” in Microsoft Word 2003 that happen pretty regularly when working on multiple, large documents for any significant length of time.  The WINWORD.EXE process is taking up 50% of my CPU (which as a dual-core processor, means that there’s a thread that’s somehow taking up 100% of the logical CPU for which it’s scheduled), and has been doing this for at least ten minutes now with no letup.

In my experience, these “runaway consumers of CPU cycles” just never quiesce — eventually I have to decide to kill WINWORD.EXE from Task Manager or Process Explorer, or else the offending process will consume that “CPU” from now until the end of time.

Maybe I was just bored today, ’cause rather than just kill the runaway process, I decided to see if I could dig a little deeper.  [I think Mark Russinovich has infected me with the idea that these are surmountable problems — though I wouldn’t dream of trying to make a favourable comparison between my haphazard hacking and Mark’s mad skillz.]

Process Explorer

Let’s have a look at a few screenshots, shall we?

image 
(Performance stats, in case that’s useful to anyone — though it doesn’t provide me any telling evidence)

image
(Listing of the threads currently instantiated in WINWORD.EXE including the main thread, which is the one causing all the problems)

image
(Stack contents for the WINWORD.EXE thread)

image
(Stack contents for GdiPlus.DLL thread, which was the only other thread with any activity under the “CSwitch Delta” heading)

Process Monitor

Once I decided to investigate, I fired up Process Monitor and limited it to WINWORD.EXE.  The activity logged is almost entirely like this:

image

Don’t strain your eyes too badly on this — I’ve included this just to note the incessant nature of the major activity here: a rapidly-repeating WriteFile operation on a single Temporary file (~WRS1954.tmp), interrupted once in a while by a smaller (Length of anywhere between 512 and 3072) ReadFile operation on the same file:

image

Interestingly, these ReadFile operations occur in an irregular but repeating pattern:

image

Also of note is the fact that this temporary file is constantly growing in size, and not just temporarily swelling the data stored within a pre-allocated file — I confirmed that by right-clicking on the Path in Process Monitor, chose “Jump to Location…” and simply Refreshed on the folder to observe the reported file Size was incrementing every time (over a span of 50 minutes, it grew by approx. 222 Kb, or 233657856 bytes).

If I look closer at the Handles for WINWORD.EXE, I notice that this is one of many Temporary files open by Word, which implies that the problem we’re experiencing is very specific to one type of unexpected activity (and not just affecting Word’s handling of Temporary files):
image
(Note: I intentionally “hashed” out the original filename, which is the last entry in the list pictured.)

One other piece of information: I tried to resize the Window in which the active document was being displayed.  Word appended “(Not Responding)” to its Title Bar, and that seems to have changed the behaviour profile of the WINWORD.EXE thread.  Since that point in time, Process Monitor did not record any further increase in the size of the ~WRS1954.tmp file, but recorded one additional ReadFile operation on the WINWORD.EXE file itself (Offset: 3998720, Length: 4096).  [WINWORD.EXE File version = 11.0.8169.0, Digital signature timestamp = May 31, 2007 12:38:03 PM]

Finally, I grabbed a full memory dump of the WINWORD.EXE process, using windbg.exe and the .dump /ma command.  I can’t say I know much about debugging a dump file, but I’ve got it on the off-chance that I ever find a good guide to debugging.

What Caused This?

Three circumstances I think contributed to this, though in my opinion none of them should lead to hung process (since I’ve done this more often without incident):

  1. I had opened a Word 2003 document directly from Outlook (it was attached to an email).
  2. The document had Track Changes enabled, and I’d already added Comments throughout the document.
  3. In the Comment I was just editing, it had scrolled off screen…
    image
    …and I had just attempted to apply formatting (I’d typed [Ctrl]-B and [Ctrl]-I rapidly,to bold and italicize) to a single word in the Track Changes panel below the document (the one that opens automatically when you keep typing in Comments that have already “scrolled off screen”).
     image
    (Note: I intentionally redacted the confidential text — but it sure ain’t artistic)

Caveat: While my experience with Word over the years has taught me that heavy use and abuse of the Comments feature leads to instability, I’m still miffed that I’d lose the recent batch of edits just because I’d foolishly tried to emphasize my point using basic formatting in a Comment.

So What Can We Conclude So Far?

I don’t know much about reading a stack trace, so this is all guesstimation on my part (plus a little intelligence gathered from a good Russinovich article).  The WINWORD stack indicates that Word has called ntkrnlpa.exe aka the Windows kernel.  It looks like it’s basically stalled (KiDispatchInterrupt) while creating a new thread (KiThreadStartup).  Looking lower in the stack, the first caller in WINWORD is labelled only “0x1a772b” — whatever that is, it’s beyond my skills to unearth the identity of that API.

The next one down in the stack, however, is wdGetApplicationObject().  There’s no information in MSDN that references this function, though a few pages on the ‘net do allude to it (mostly in the same kinds of searches I made).  The best info I could find was here, which I’m guessing is Word’s way of getting a handle to the overall Word “application object”.  However, without any further context, it’s very hard to imagine what is really going on here.

Turning to the GdiPlus stack, it looks like another kernel call that’s stalled (many similar references to “WaitForMultipleObjects” functions), all boiling down to a call to the GdipCreateSolidFill() API.  From what MSDN documents, this seems like a pretty innocuous function, having nothing to do with temporary files, only to do with UI.  I can understand this — by the time I’d looked at the GdiPlus stack, I believe the UI had “hung” (aka it was non-responsive).  So while this thread was also active, it’s almost impossible for it to be involved in this issue.

Then the only thing I know for sure is the temp file was growing due to some runaway operation, and the runaway operation (which was probably related to an attempt to format Comment text) at some point obtained a handle to the Word application object.

I’m guessing that the only way to get any closer to the root cause would be to dig into the memory dump.  And…bugger me, the dump I grabbed ended up with this as its STACK_TEXT (from !analyze -v):

0adaffc8 7c9507a8 00000005 00000004 00000001 ntdll!DbgBreakPoint
0adafff4 00000000 00000000 00000000 00000000 ntdll!DbgUiRemoteBreakin+0x2d

Guess that’s “the wall” for me.

Converting from Microsoft Money to Quicken — finally, a documented solution!

After years of fighting and losing the battle to use Microsoft Money to stay on top of my finances, and after reading that Microsoft Money has only 1/4 of the market share that Quicken has (even after all these years of Microsoft trying to take over), I finally gave in (with a really sweet $33 deal via Costco.com) to sense and decided to switch my financial management to Quicken.

Unfortunately, as much as I would’ve figured that Quicken would want to make it dead-easy to convert MS Money data to the Quicken format, it turns out that not only is their conversion tool buggy and not well documented, but the conversion process is pretty damned fragile (and requires a lossy and manual process of exporting into an XML format).

I spent much of this weekend fighting with this software, troubleshooting all the various errors (mostly using Process Monitor — the best friend of anyone fighting with crappy, cryptic software) and finally came up with a solution that is (a) successful (unlike every report I’ve been able to find on the Internet) and (b) repeatable.

This is my gift to you, the many who have wisely determined that their foolish experiments with Microsoft’s finance software are best left behind.  If you’d like to hear the whole gory story, just keep reading.  If you’d prefer just the step-by-step recipe, just skip to the end of the article and enjoy!

First Error

CreateQWPAManager could not be found in qwutil.dll

I dug around Google for a while, which was not encouraging (it sounds like not one customer had ever found a tech support rep at Intuit that had any clue what this error message really means).

However, to confirm my suspicions I fired up DEPENDS.EXE (no, not an adult undergarment utility) and confirmed that the version of QWUTIL.DLL that ships with Quicken Deluxe 2007 does not include the CreateQWPAManager() API.

I tried to find a more up-to-date version of this file that I would’ve expected Intuit to make available for users of the Data Converter utility, but so far they have nothing there.  So I fired up my favourite torrent utility and “borrowed” a copy of Quicken 2008 Home & Business (which at the time was the only version of Quicken 2008 that was being “shared”).

In the DISK1 folder of the CD image there’s a file called DATA1.CAB.  Opening that up with Winzip, I was able to extract a copy of QWUTIL.DLL.

I then renamed the Quicken 2007 version, copied the Quicken 2008 version to the C:\Program Files\Quicken folder, and confirmed with DEPENDS.EXE that this 2008 version of QWUTIL.DLL indeed does include the CreateQWPAManager() API.

[Now, because the Data Converter utility loads QWUTIL.DLL directly from the Quicken program directory, it isn’t possible to just copy the QWUTIL.DLL into the C:\Program Files\Data Converter folder (which is what would normally work with most Windows apps).  Instead we have to “replace” the old file with the new one in the same location (and then restore the old file once we’re done).]

Second Error

The data converter can not find the Microsoft Money Account Transactions*.xml report.
Try re-creating the Account Transactions*.xml report.

I fired up Process Monitor, and spotted the MNY2QDF.EXE process failing to QueryDirectory for C:\Documents and Settings\mikesl\Desktop\Investment Transactions*.xml and :\Documents and Settings\mikesl\Desktop\Account Transactions*.xml.

So after a bit of head-scratching, I renamed the report XML files to “Investment Transactions.xml” and “Account Transactions.xml”.

Third Error

The data converter does not recognize the report as a Microsoft Money Account Transactions report.
This could be because the report was renamed or because it is not the correct report.

Edited the contents of the element in the XML file(s) to replace the unexpected filename (e.g. “Quicken export”) with the required filename (without the .xml suffix):

Filename

XML Tag

Investment Transactions.xml Investment Transactions
Account Transactions.xml Account Transactions

Note that this error can be caused by problems with either the Account Transactions or the Investment Transactions report.

Fourth Error

The Microsoft Money Investment Transactions report must include certain fields.
Make sure that the Investment Transactions report includes the Account, Date, Investment, Activity, Shares, Price and Amount fields, and create it again.

I temporarily renamed the “Investment Transactions.xml” file and Data Converter was able to complete its work on Account Transactions.xml”.

<Edit> A little more detail… This error message was caused by the simple fact that I didn’t include all of the fields it mentions here.  At first I was confused by this error because one of the many guides on the Quicken site spelled out exactly which checkboxes to include, and (I checked afterwards) it definitely missed one of the fields, which I dutifully did not check.  Later, once I re-ran the report with all the fields included, the data converter successfully converted the Investment Transactions report as expected.

(koty, a reader, asked me to clarify what I meant by “temporarily renamed” — it probably would’ve been less confusing if I’d said I temporarily removed the file so that the data converter would only focus on the Account Transactions file.) </Edit>

Fifth Error

The converter tried to launch Quicken 2007 automatically once it completed the conversion, but Quicken 2007 immediately stops when the 2008 QWUTIL.DLL is present.

While I’d hoped that Quicken 2007 could operate properly with the 2008 version of this file, I had to remove the 2008 version and replace it with the 2007 version of QWUTIL.DLL.

Then I had to locate where the converter had dumped the converted Accounts data.  I’d assumed it would copy the results into the currently-configured Quicken data file (the one I’d created when I figured launched Quicken 2007), but that was still empty.   It turns out that, even though ‘d made a point of storing the Quicken Data file that I’d created originally in a non-default location, the converter didn’t find that file, nor try to put its conversion results in the same folder.  Instead, it created a file QDATAMNY.QDF in My Documents\Quicken.

Sixth Error

When I asked Quicken 2007 Deluxe to open the QDATAMNY.QDF file, it gave me this warning:

Important
You are currently using Quicken 2007 Deluxe.
The current data file was last used with Quicken Home & Business.
To add the additional features of Quicken 2007 Home & Business again, click Unlock Again.

I chose Unlock Later.

This is the result of me temporarily “borrowing” the only Quicken 2008 version I could get my hands on, and just manually swapping the QWUTIL.DLL files back & forth depending on what I needed.  [Hopefully this doesn’t cause any long-term compatibility issues with the converted data.]

Once I switched back to the original version of QWUTIL.DLL, Quicken started up fine, and was able to open the QDATAMNY.QDF file that the Data Converter had just created.

Conclusions

  1. Intuit should be more forgiving when seeking the named reports on the Desktop — it’s hard-coded to look for the default names of those reports, and while that would work in most cases, I was one of those crazies that renamed the reports (in the “Title” field of the Customize Report UI).  Without any documentation that the Data Converter only looks for the default Report names, it was not at all easy for me to figure out the source of the Second Error.
  2. Intuit needs to clearly test and document the dependencies for the Data Converter and ensure that anyone who downloads the Data Converter can use it in the proscribed environment.  Currently, the version of the Data Converter that is available on from Intuit only works with the 2008 version of QWUTIL.DLL.
  3. Intuit needs to ensure that what the Data Converter requires from the MS Money Reports is what’s documented in their Help files.  For example, in this page‘s Step 4, #3, they fail to include the Account field.
  4. Intuit should be more forgiving about the naming of the MS Money Reports that it’s trying to convert.  There’s no good reason to both look for hard-coded names for the files and require the XML element inside the file be exactly the same as the file’s filename.
  5. MS Money is not required to be installed on the same computer where the Data Converter is run (Process Monitor watched MNY2QDF.EXE during startup and during the conversion, and never did I see Data Converter try to find, open or use any of the Money files or Registry settings).  However, you do have to have a copy of MS Money so you can create the specified Reports, and you need to get the reports onto the Desktop of the user running the Data Converter.

Bottom Line: Recipe for Converting MS Money transactions data to Quicken

  1. Start with the instructions here, and whenever I haven’t specified something here, do whatever the Quicken help says.
  2. Install your version of Quicken on your current computer.
  3. Download the Microsoft Money to Quicken Data Converter and install it on the same computer.
  4. Get a copy of QWUTIL.DLL from any Quicken 2008 edition (if you can’t find one elsewhere, you should be able to find a Quicken 2008 Home & Business version here).
  5. Rename the installed Quicken’s QWUTIL.DLL to QWUTIL.DLL.ORIGINAL (by default, it’s found under C:\Program Files\Quicken\qwutil.dll).
  6. Copy the 2008 version of QWUTIL.DLL to the folder where you renamed the original one (e.g. copy it to C:\Program Files\Quicken).
  7. Follow the instructions for creating the Account Transactions Report and Investment Transactions Report (e.g. Step 3 and Step 4 from the online Help).  Ensure you preserve the default Title of each report.  If the XML file(s) on the Desktop are not named Account Transactions.xml and/or Investment Transactions.xml, then rename the resulting XML file(s) found on your Desktop to Account Transactions.xml and/or Investment Transactions.xml and edit these files’ XML to match the settings in the table above.
  8. Click the “Import Into Quicken” button in Data Converter.  You’ll eventually see a flash of the Quicken splash screen, then you’ll notice that neither the Data Converter nor Quicken will not be running.
  9. Delete the 2008 version of QWUTIL.DLL from the Quicken folder, and rename QWUTIL.DLL.ORIGINAL back to qwutil.dll.
  10. Launch Quicken, open the My Documents\Quicken\QDATAMNY.QDF file and party on with your old MS Money data!  [Oh, and I believe you can safely ignore any warning about the Quicken data file having been opened last time with a version of Quicken you don’t have installed; that’s a one-time warning that is the result of this workaround.]

MyPicasaPictures Part 5: Hacking the XP development environment for Vista Media Center applications

So I’m going through the Windows Media Center Application Step by Step guide (which incidentally, is in the XPS format — Microsoft’s pretender to the PDF throne, and a risky choice for any application to choose ahead of the widespread adoption of the underlying platform).  I’ve gotten as far as to add the References to the MediaCenter assemblies, when I realize (for the first time) that developing a Media Center application might very well mean that I have to have the development environment installed on a Media Center PC.

This seems like an odd dependency to have, considering all the other types of development projects that don’t necessarily require you to run your IDE on the target platform.  So I’m wondering if there’s a way to either (a) just drop the assembly files (i.e. the DLLs) into the expected directory and just reference them & go, or (b) I’ll have to figure out how to register those assemblies on my non-Media Center system.

On pages 5 and 6 of the Step by Step guide the instructions indicate to add references to the Microsoft.MediaCenter.dll and Microsoft.MediaCenter.UI.dll assemblies.  While it assumes you’re working from a Vista Premium/Ultimate machine and have those assemblies in the %SYSTEMROOT%\ehome folder, I found that copying those files from the VMC box and browsing to their copied location on my Windows XP SP2 system seemed to resolve the references just fine.  [I’m not convinced there won’t be other issues later on, but it’s at least worth a try.]

What’s The Sample Code Doing?

The code going into Application.cs looks interesting, but without any Comments to help us understand the purpose of each line, it’s not very instructive.  For me, this is all I’m able to infer:

  • we’re creating an Application class (though I don’t know what VMC developers would think of as an “application” — an add-in?  a piece of UI?  an assembly?  the whole Media Center process and children?)
  • we’re creating three Properties, both a private instance and its public instance (though I don’t really have any specific idea what these properties are meant to do)
  • the Application “object” is overloaded, but I’m not even sure if it’s a Property or something else
  • there are references to this, but I don’t really know what the “this” is referencing — is it the Application class, or is it something else?  [It doesn’t help that I’m not an expert at coding, so I don’t intuitively recognize what each of these things are, but it sure would help to reinforce my tentative steps into this if I had Comments that told me what I was looking at.]

I’m also puzzled by why this code doesn’t follow the Design Guidelines for Managed Class Developers, when nearly all managed code published by Microsoft adheres to these standards.  For example,

public void DialogTest(string strClickedText)

does not need “str” to prefix the ClickedText variable (aka Hungarian notation) — Visual Studio will always indicate the datatype with tooltips.

As another example, the private member variables

            int timeout = 5;
            bool modal = true;
            string caption = Resources.DialogCaption;

are all named using Camel case but without any prefix (such as a leading “_” character).  I understand that the Design Guidelines don’t specifically call out prefixing rules for member (? or private ?) variables, but apparently the guidelines do recommend the use of “this.” to prefix instance variables (which is awful confusing, as you can see above).  Personally I’d prefer to see “_” or “m_” prefixes on private member variables, but I’d at least like to see *some* consistency in the use (or non-use) of prefixes on variables.

While prefixing is a religious fight, the use of Hungarian is not — it’s clearly not recommended.

Compile Errors

I followed the Step by Step through to Build the Solution, at which point I got back one error in the compiler, at Line 55 of my code:

‘MyProject.Resources’ does not contain a definition for ‘DialogCaption’

If I followed the code adequately, the Step by Step has a bug (or at least an ambiguity) in the step Add Resources.resx/Add Strings.  Whereas the Step by Step guide actually wanted me to create one String called “DialogCaption” in the Resources file, I understood it to instruct me to create two Strings — one called “Name” and the other called “Value”:
Snippy0002

It seems pretty obvious now, but at the time my interpretation seemed intuitive.

 

After resolving that more trivial issue, the next Build confronted me with this error:

The command “%windir%\eHome\McmlVerifier.exe -verbose -assemblyredirect:”C:\VS2005 Projects\MyProject\MyProject\bin\Debug” -directory:”C:\VS2005 Projects\MyProject\MyProject\Markup”” exited with code 9009.

This turned out to be harder (and dumber) to diagnose.  There were no specific errors related to McmlVerifier and 9009, and the first real lead from Google referred to path quoting issues.  I finally realized that the McmlVerifier.exe file didn’t even exist on my XP (development) system.  Strangely though, it didn’t exist on the VMC system either – is this one of the SDK tools?  Yes.  However, for some reason the SDK didn’t install it into the %windir%\ehome directory.

Once I copied McmlVerifier.exe to the ehome directory, the Build completed successfully.

Strangely, most of the steps under Enable UI Testing and MCML Verification (steps 7-17) were redundant – they were already completed on my behalf.

Debugging

When I tried Debugging the solution, I got yet another error:
image

System.DllNotFoundException was unhandled

Unable to load DLL “EhUI.dll”: The specified module could not be found.
(Exception from HRESULT: 0x8007007E)

However, this time when I merely copied EhUI.dll to the %windir%\ehome directory, the Debugger threw another exception, slightly more impenetrable:
 image

System.DllNotFoundException was unhandled

Unable to load DLL “EhUI.dll”: The specified procedure could not be found.
(Exception from HRESULT: 0x8007007F)

The stack trace for this was:

at Microsoft.MediaCenter.Interop.RenderApi.SpWrapBufferProc(ProcessBufferProc pfnProcessBufferProc, IntPtr* ppNativeProc)\r\n
at Microsoft.MediaCenter.Interop.RenderApi.InitArgs..ctor(MessageCookieLayout layout, ContextID idContextNew, ProcessBufferProc pfnProcessBufferProc)\r\n
at Microsoft.MediaCenter.UI.MessagingSession..ctor(UiSession parentSession, ContextID idLocalContext, TimeoutHandler handlerTimeout, UInt32 nTimeoutSec)\r\n
at Microsoft.MediaCenter.UI.UiSession..ctor(ContextID idLocalContext, IServiceProvider parentProvider, RenderingInfo renderingInfo, EventHandler rendererConnectedCallback, TimeoutHandler handlerTimeout, UInt32 nTimeoutSec)\r\n
at Microsoft.MediaCenter.UI.UiSession..ctor(RenderingInfo renderingInfo)\r\n
at Microsoft.MediaCenter.Tools.StandAlone.Startup(String[] args)\r\n
at Microsoft.MediaCenter.Tools.StandAlone.Startup()\r\n
at Microsoft.MediaCenter.Tools.McmlPad.Main(String[] args)

Given that I’ve got the Vista version of EHUI.DLL in the right place, I assumed I’d have to “register” it (I forget the equivalent in the .NET Framework world) so that its procedures could be “found”.  However, before going down a “.NET theory” path I decided to google the error message.  The first five or so forum posts that came back pointed finally clued me in that in fact this may represent a second-order dependency in another missing DLL.  With that, I decided to just copy in the entire contents of my VMC’s %WINDIR%\eHome directory and retry debugging.

Debugging exponentially: Process Monitor, Dependency Walker

That wasn’t any more successful, so next I fired up Process Monitor and watched for any “NOT FOUND” errors that followed the EHUI.DLL load event (hoping that this was merely a case of a missing filesystem reference, and not something more sinister).  That helped me discover the following errors (presumably, as is often the case in these situations, mostly a list of false negatives):

  • McmlPad.exe called RegCreateKey() for HKLM\Software\Microsoft\Fusion\GACChangeNotification\Default
  • McmlPad.exe failed to find OLE32.DLL under C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727, and oddly didn’t try searching the environment’s PATH locations
  • devenv.exe searched multiple times for C:\temp\symbols.flat.txt and McmlPad.pdb (symbols) under C:\temp\symbols\McmlPad.pdb\6B1042D64F29479FA1C07939AE072D941\, C:\Windows\symbols\exe, C:\Windows\exe, C:\Windows
  • McmlPad.exe failed to find C:\WINDOWS\assembly\GAC\PublisherPolicy.tme (and didn’t look anywhere else)
  • McmlPad.exe tried looking for a couple of files in the GAC search paths before it went looking directly for the file in C:\Windows\ehome:
    • Microsoft.MediaCenter\6.0.6000.0__31bf3856ad364e35
    • Microsoft.MediaCenter.UI\6.0.6000.0__31bf3856ad364e35
  • Once McmlPad.exe successfully loaded EHUI.DLL from C:\Windows\ehome, it (successfully) opened these files:
    • C:\WINDOWS\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\sorttbls.nlp
    • C:\WINDOWS\assembly\GAC_32\mscorlib\2.0.0.0__b77a5c561934e089\sortkey.nlp
  • Then devenv.exe successfully loaded C:\Program Files\Microsoft Visual Studio 8\Common7\Packages\Debugger\cscompee.dll
  • Then McmlPad.exe loaded C:\WINDOWS\assembly\NativeImages_v2.0.50727_32\System.Configuration\eee9b48577689e92db5a7b5c5de98d9b\System.Configuration.ni.dll

Hmmm, I’m not sure I learned much from that.  It looks like the application being debugged (McmlPad.exe) was looking for some GAC registration info and a few obscure files.  However, it’s likely that this is expected behaviour even on a working system.

So I went back to my google search results which convinced me to try DEPENDS.EXE to see what it would say.  I expected nothing, frankly, but that gave me two new leads I wouldn’t have otherwise found: it indicates that EHUI.DLL is looking for DWMAPI.DLL and DXGI.DLL, neither of which were present anywhere on my system.

Fascinating – one of the first Forum posts I found referencing DWMAPI.DLL indicates this file may not be needed when “linked” by a multi-OS-targeting application.  However, I suspect that for VMC libraries like EHUI.DLL, these two missing DLLs are not “load on demand” libraries – they’re just libraries that don’t appear on non-Vista platforms.

Once I grabbed copies of these two files from my Vista machine and dropped them into the %windir%\ehome folder, DEPENDS.EXE warned me that D3D10.DLL and NUCLEUS.DLL were required for DXGI.DLL, and that EHTRACE.DLL (a demand-load library) was missing.  Okey-dokey… and then I’m “warned” that even with all these files in place…:

Error: At least one module has an unresolved import due to a missing export function in an implicitly dependent module.
Warning: At least one module has an unresolved import due to a missing export function in a delay-load dependent module.

Feeling lucky despite all evidence to the contrary, I re-ran MyProject from Visual Studio, but no love was had – still and all, “the specified procedure could not be found”.  OK, let’s continue…

The unresolved imports were flagged in the following libraries on my XP development PC:

Library

Unresolved Import(s)

MSVCRT.DLL _except_handler4_common
_ftol2
_ftol2_sse
D3D9.DLL Direct3DCreate9Ex
USER32.DLL IsThreadDesktopComposited
ADVAPI32.DLL RegGetValueW
MPR.DLL WNetRestoreConnectionA

So, the obvious question is: if I copy *these* libraries from my VMC computer to C:\windows\ehome on my development XP computer, will that suffice to allow McmlPad.exe + Visual Studio 2005 to successfully debug my VMC app?

And the answer is: not quite yet.  Unfortunately, when loading EHUI.DLL, Windows will still end up loading the standard libraries (such as MSVCRT.DLL) from the PATH (i.e. %windir%\system32).  I realize that Windows is just trying to behave nicely, not being “tricked” into loading rogue versions of already-installed libraries, so I am not too upset by this.  However, this has turned into a much more difficult problem than I first anticipated.

Files copied to my development system so far:

  • %windir%\ehome\Microsoft.MediaCenter.dll (from VMC system)
  • %windir%\ehome\Microsoft.MediaCenter.UI.dll (from VMC system)
  • %windir%\ehome\ehui.dll (from VMC system)
  • %programfiles%\Microsoft SDKs\Windows Media Center\v5.0\Tools\McmlVerifier.exe (to %windir%\ehome)
  • %programfiles%\Microsoft SDKs\Windows Media Center\v5.0\Tools\McmlPad.exe (to %windir%\ehome)
  • %windir%\system32\dwmapi.dll (from VMC system)
  • %windir%\system32\dxgi.dll (from VMC system)
  • %windir%\system32\d3d10.dll (from VMC system)
  • nucleus.dll (from I don’t remember where)
  • %windir%\system32\msvcrt.dll (from VMC system)
  • %windir%\system32\d3d9.dll (from VMC system)
  • %windir%\system32\user32.dll (from VMC system)
  • %windir%\system32\advapi32.dll (from VMC system)
  • %windir%\system32\mpr.dll (from VMC system)

  Tidbits

  • 😦 “I will say that if you are a hobbyist then MCML is not likely to be very friendly to you.”
  • 😦 “It took me a long while to create a gallery that displays a list of images from an ArrayListDataSet.”
  • 🙂 “I have persevered in the days since my last post and have managed to get a repeater/scroller working to emulate the behaviour of the ‘my music’ section of media center.”
  • 🙂http://mobilewares.spaces.live.com/ has a good tutorial on making the sliding menu at the top of the screen.”
  • 😦 “Following the rumour about Microsoft encouraging their internal teams to deprecate the use of an underscore prefix for private fields, I have decided to do the same.”  [I just got my head wrapped around the use of underscores for private fields, and now I need to unlearn that behaviour just as quickly.  At least I haven’t done too much damage with it.]
  • 🙂 “If you can’t easily work out where the variable is defined (local scope, local parameter, member variable) then your code is too complex and needs refactoring.  It’s that simple.”  [I can handle that idea, yeah.]

Debugging persistent Outlook crashing – can only go so far…

I’ve been experiencing a persistent crash in Outlook for months now – often Outlook will crash when I Send an email.  I suspect it’s related to the fact that in the main Outlook form ( Mail/Inbox) the Reply, Reply All and Forward buttons and keyboard shortcuts are inactive.  [They work from the context menu of an individual message, or if I open a message and use the buttons from the Toolbar displayed above the message itself.  Yes, very weird.]  I suspect it’s a result of an unclean uninstall of the Getting Things Done add-in for Outlook – which I used to like, but which has been supplanted by the combination of MindManager and ResultsManager (at least this week).

In any case, I’ve captured multiple .DMP files, but when I try to debug them I get very sketchy results.  I used to think it was because I don’t have access to symbols for the add-ins that I’ve installed for Outlook – which is normally where these kinds of crashes come from.  However, I’ve disabled all the add-ins that are listed in Outlook > Tools > Options > Other > Advanced Options > “Add-in Manager” & “COM Add-Ins”, and I’m still getting the same kind of crashing behaviour.  I’m still getting spotty results, which tells me I don’t even have symbols for Outlook (to map the function offsets that are listed in the dump), and I’ve been beating my head against a wall trying to figure out how to get access to them.

I’m almost positive I’ve got the Microsoft Internet Symbol Server configured correctly, and yet I continue to get errors like this:

*** ERROR: Symbol file could not be found.  Defaulted to export symbols for OUTLLIB.DLL –
OUTLLIB!OlkGetUIlangID+0xd434:

I discovered that you can debug the loading operations with the !sym noisy command.  Once I enabled this, I saw this in my .reload output:

0:000> .reload
…………………………………………………………………………………………………………………………………………………..
SYMSRV:  C:\symbols\OUTLLIB.DLL\4566283D749000\OUTLLIB.DLL not found
SYMSRV:  http://msdl.microsoft.com/download/symbols/OUTLLIB.DLL/4566283D749000/OUTLLIB.DLL not found
DBGENG:  C:\Program Files\Microsoft Office\OFFICE11\OUTLLIB.DLL – Mapped image memory
SYMSRV:  C:\symbols\outllib.pdb\0EAE667B6A73417A9D7DC2E4C81382232\outllib.pdb not found
SYMSRV:  http://msdl.microsoft.com/download/symbols/outllib.pdb/0EAE667B6A73417A9D7DC2E4C81382232/outllib.pdb not found
DBGHELP: outllib.pdb – file not found
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for OUTLLIB.DLL –
DBGHELP: OUTLLIB – export symbols

I wanted to double-check that it wasn’t just a lack of some specific version of OUTLLIB.DLL, so I browsed to http://msdl.microsoft.com/download/symbols/OUTLLIB.DLL/, and received a 404 error.  To make sure there wasn’t some subtle IIS configuration issue, I tested http://msdl.microsoft.com/download/symbols/KERNEL32.DLL (a known good library), which gave me a 403 (Forbidden) error.

That tells me that Microsoft still hasn’t published Office symbols to the Internet – even while they’re trying to push application developers to use Office as a “platform” on which to build enterprise-class applications (VSTO, VSTA).  That’s a really noticeable gap, at least to me.

In any case, this is as good a debug output as I’m able to get:

FOLLOWUP_IP:
OUTLLIB!OlkGetUIlangID+d434
301b7b82 ff506c          call    dword ptr [eax+6Ch]

SYMBOL_STACK_INDEX:  0

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: OUTLLIB

IMAGE_NAME:  OUTLLIB.DLL

DEBUG_FLR_IMAGE_TIMESTAMP:  4566283d

FAULTING_THREAD:  000014d0

DEFAULT_BUCKET_ID:  NULL_CLASS_PTR_DEREFERENCE

PRIMARY_PROBLEM_CLASS:  NULL_CLASS_PTR_DEREFERENCE

BUGCHECK_STR:  APPLICATION_FAULT_NULL_CLASS_PTR_DEREFERENCE

SYMBOL_NAME:  OUTLLIB!OlkGetUIlangID+d434

FAILURE_BUCKET_ID:  APPLICATION_FAULT_NULL_CLASS_PTR_DEREFERENCE_OUTLLIB!OlkGetUIlangID+d434

BUCKET_ID:  APPLICATION_FAULT_NULL_CLASS_PTR_DEREFERENCE_OUTLLIB!OlkGetUIlangID+d434

Unfortunately, Google Groups and the web have nothing helpful to understand what “APPLICATION_FAULT_NULL_CLASS_PTR_DEREFERENCE” even means, let alone how to fix the problem it flags.

If I ever come up with a good answer to this, I’ll be sure to post it.  In the meantime, if anyone has any clues or hints for me on what I could do to narrow this down any further (e.g. any cool or powerful commands that I don’t know about – which probably includes anything beyond .symfix, .reload, k and !analyze -v.