I've experimented with audio transcription lately, but always with big, clumsy humans. I'd happily use cyborgs speech recognition software, but even today, automatic conversion of voice-to-text is still flawed. Naturally, I was intrigued when Google announced they were adding voice searching to their Google Mobile iPhone app.

Google's flirted with voice-to-text conversion in the past, with GOOG-411 and their Audio Indexing of political videos on YouTube. But this is the first time they're offering a web-accessible interface for speech conversion, albeit completely undocumented, so I decided to poke around a bit to see what I could find.
Over the last few hours, I've been analyzing the traffic proxied through my network, trying to reverse-engineer it to get to something usable, but I've hit my limits. I'm posting this with the hopes that someone out there can run with it and find out more.
Behind the Scenes
Here's my best guess: When you first start speaking into the microphone, the iPhone app opens a connection to Google's server, waits for you to finish talking, and then does a quick and dirty conversion into a smaller binary representation of the waveform. (And I do mean tiny. These files are between 100-300 bytes.) These binary files aren't the audio, read the Updates section below for more.
The waveform image is generated on the phone and displayed along with a "Working" indicator and the adorable "beep-boop" sounds. In the background, the binary file is being sent as a POST request to http://www.google.com/m/appreq/gmiphone. Here's what the headers look like:
POST /m/appreq/gmiphone HTTP/1.1 User-Agent: Google/0.3.142.951 CFNetwork/339.3 Darwin/9.4.1 Content-Type: application/binary Content-Length: 271 Accept: */* Accept-Language: en-us Accept-Encoding: gzip, deflate Pragma: no-cache Connection: keep-alive Connection: keep-alive Host: www.google.com
The response from Google is an even smaller binary attachment. This is probably just an encrypted or compressed version of the converted text. In this case, for the words "chicken soup." These binaries are irrelevant — read the Updates section below for more.
HTTP/1.1 200 OK Content-Type: application/binary Content-Disposition: attachment Date: Tue, 18 Nov 2008 13:06:53 GMT X-Content-Type-Options: nosniff Expires: Tue, 18 Nov 2008 13:06:53 GMT Cache-Control: private, max-age=0 Content-Length: 114 Server: GFE/1.3
After receiving the binary response to the POST, a second request is triggered, this time a GET request to clients1.google.com with the converted voice-to-text string.
GET /complete/search?client=iphoneapp&hjson=t&types=t
&spell=t&nav=2&hl=en&q=chicken%20soup HTTP/1.1
User-Agent: Google/0.3.142.951 CFNetwork/339.3 Darwin/9.4.1
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
Pragma: no-cache
Connection: keep-alive
Connection: keep-alive
Host: clients1.google.com
The response is an array of search terms in JSON format, for use in search autocompletion.
["chicken soup",[["http://www.chickensoup.com/","Chicken Soup for the Soul",5,""],["http://www.chickensoupforthepetloverssoul.com/","Chicken Soup for the Pet Lover's Soul",5,""],["chicken soup recipe","489,000 results",0,"2"],["chicken soup for the soul","1,470,000 results",0,"3"],["chicken soup dog food","462,000 results",0,"4"],["chicken soup with rice","467,000 results",0,"5"],["chicken soup diet","453,000 results",0,"6"],["chicken soup from scratch","364,000 results",0,"7"],["chicken soup for the soul quotes","398,000 results",0,"8"],["chicken soup crock pot","604,000 results",0,"9"]]]
Aaand that's as far as I can get.
Help!
Unfortunately, until I can figure out the format of the binary request and response to/from Google, playing with the voice recognition features is out of reach.
How much processing is happening on the phone, and how much on Google's servers? If it's happening remotely, in what form is the audio being transmitted and the results being returned? As Ilya points out in the comments, the response binary file is too limited to even hold the text.
Any ideas on cracking this mystery would be hugely appreciated. Anonymity for Google insiders is guaranteed!
Updates
As several commenters figured out, and confirmed to me by Google, the audio is being sent to Google's servers for voice recognition. The two binaries I posted above aren't the actual transmission, and are actually identical for every query, so can be disregarded. Sorry about the red herring.
Gummi Hafsteinsson, product manager for Google's Voice Search, says, "I can confirm that we split the audio down to a smaller byte stream, which is then sent to Google for recognition, but we can't really provide any details beyond that." Responding to my request for a public API, he added, "I appreciate the suggestion to provide voice recognition as a service. Right now we have nothing to announce, but we'll take this feedback as we look at future product ideas."
Also, Chris Messina discovered some secret settings in the application's preferences file, including alternate color schemes and sound sets for "Monkey" and "Chicken." Beep-boop!
Next step: Can anyone figure out the format of the audio and spoof a request to Google? Some commenters think it's in AMR format, which makes sense.

Waxy.org is the sandbox of 







