Turning an unknown audio data stream into wav or similar format

Question

I am trying to get the commentary (casters voice) from a dota2 game file. I've managed to parse the game file and select what I believe is the voice data. This is in a weird format (CSVCMsg_VoiceData) which has the following struc:
type CSVCMsg_VoiceData struct {
Client                   *int32            `protobuf:"varint,1,opt,name=client" json:"client,omitempty"`
Proximity                *bool             `protobuf:"varint,2,opt,name=proximity" json:"proximity,omitempty"`
Xuid                     *uint64           `protobuf:"fixed64,3,opt,name=xuid" json:"xuid,omitempty"`
AudibleMask              *int32            `protobuf:"varint,4,opt,name=audible_mask" json:"audible_mask,omitempty"`
VoiceData                []byte            `protobuf:"bytes,5,opt,name=voice_data" json:"voice_data,omitempty"`
Caster                   *bool             `protobuf:"varint,6,opt,name=caster" json:"caster,omitempty"`
Format                   *VoiceDataFormatT `protobuf:"varint,7,opt,name=format,enum=VoiceDataFormatT,def=1" json:"format,omitempty"`
SequenceBytes            *int32            `protobuf:"varint,8,opt,name=sequence_bytes" json:"sequence_bytes,omitempty"`
SectionNumber            *uint32           `protobuf:"varint,9,opt,name=section_number" json:"section_number,omitempty"`
UncompressedSampleOffset *uint32           `protobuf:"varint,10,opt,name=uncompressed_sample_offset" json:"uncompressed_sample_offset,omitempty"`
XXX_unrecognized         []byte            `json:"-"`

}
This seems to work when reading the data. Logically I'm probably looking for the VoiceData part of the struct when given this:
"format":0,"voice_data":"uz+ACgEAEAELgD4EQgEWAKV4mxnepfmhxKCQxAnKVNaHhKRXPIsmAH5RjXmJV0u+WTmrvgyCKxcraehjo/ZeKcFjksXQZEeOju4hLNv/MAB9KA7ww14Vc0ndYPB7dDXoXTexuxcW0Jg/diMgdH5ijWhe02Ch48KX86qJZYFyZV81AH76qCgh9AXliMdyWEgWTMbRD6xMX37WJALrXlSnxymIloSq2KGwXCcMXzQiSQIrcLVNfqdNJACCluFOIRKPmugUvsLZmnD04X0xhpAuNkwJECK4t51MBOWNWJlCAIDyZlJwWI45EPTjBB6yKyGOclu96qBV2MhFAh1d2J7WDZwe6YxOVu/BGkGcur9qTP85ZRfjANoiQxQrWvpoHFBFBy0AfX6k8XvbSwrk2nUAEP3P6kcmXORKUNKeu8HDnOUflQqtA5AkkTiun77fZrqnimIfWg==","sequence_bytes":23598094,"section_number":1,"sample_rate":16000
I'm able to pull the voice data out like so:
uz+ACgEAEAELgD4EQgEWAKV4mxnepfmhxKCQxAnKVNaHhKRXPIsmAH5RjXmJV0u+WTmrvgyCKxcraehjo/ZeKcFjksXQZEeOju4hLNv/MAB9KA7ww14Vc0ndYPB7dDXoXTexuxcW0Jg/diMgdH5ijWhe02Ch48KX86qJZYFyZV81AH76qCgh9AXliMdyWEgWTMbRD6xMX37WJALrXlSnxymIloSq2KGwXCcMXzQiSQIrcLVNfqdNJACCluFOIRKPmugUvsLZmnD04X0xhpAuNkwJECK4t51MBOWNWJlCAIDyZlJwWI45EPTjBB6yKyGOclu96qBV2MhFAh1d2J7WDZwe6YxOVu/BGkGcur9qTP85ZRfjANoiQxQrWvpoHFBFBy0AfX6k8XvbSwrk2nUAEP3P6kcmXORKUNKeu8HDnOUflQqtA5AkkTiun77fZrqnimIfWg==
However this is where I'm hitting a bit of a wall. This data is in an unknown format. I've tried to do some research on what the format might be and I've found that steam started using SILK codec for voice data in 2011 - however when trying to write this data to file and open it with opus (which I believe supports SILK) the opus decoder tells me it can't open the file - so I'm not 100% convinced it is silk codec. Recognising audio data isn't something I have a great deal of experience with - so any advice would be great.
I have noticed there's a VoiceDataFormatT part of the struct but the only definition I can find for it is this:
type VoiceDataFormatT int32

Which doesn't seem too helpful! :/
EDIT 1:
As per advice from user Ian Cook I've decoded the data from base64 into the following (as hex dump):
BB 3F 80 0A 01 00 10 01 0B 80 3E 04 42 01 16 00 A5 78 9B 19 DE A5 F9 A1 C4 A0 90 C4 09 CA 54 D6 87 84 A4 57 3C 8B 26 00 7E 51 8D 79 89 57 4B BE 59 39 AB BE 0C 82 2B 17 2B 69 E8 63 A3 F6 5E 29 C1 63 92 C5 D0 64 47 8E 8E EE 21 2C DB FF 30 00 7D 28 0E F0 C3 5E 15 73 49 DD 60 F0 7B 74 35 E8 5D 37 B1 BB 17 16 D0 98 3F 76 23 20 74 7E 62 8D 68 5E D3 60 A1 E3 C2 97 F3 AA 89 65 81 72 65 5F 35 00 7E FA A8 28 21 F4 05 E5 88 C7 72 58 48 16 4C C6 D1 0F AC 4C 5F 7E D6 24 02 EB 5E 54 A7 C7 29 88 96 84 AA D8 A1 B0 5C 27 0C 5F 34 22 49 02 2B 70 B5 4D 7E A7 4D 24 00 82 96 E1 4E 21 12 8F 9A E8 14 BE C2 D9 9A 70 F4 E1 7D 31 86 90 2E 36 4C 09 10 22 B8 B7 9D 4C 04 E5 8D 58 99 42 00 80 F2 66 52 70 58 8E 39 10 F4 E3 04 1E B2 2B 21 8E 72 5B BD EA A0 55 D8 C8 45 02 1D 5D D8 9E D6 0D 9C 1E E9 8C 4E 56 EF C1 1A 41 9C BA BF 6A 4C FF 39 65 17 E3 00 DA 22 43 14 2B 5A FA 68 1C 50 45 07 2D 00 7D 7E A4 F1 7B DB 4B 0A E4 DA 75 00 10 FD CF EA 47 26 5C E4 4A 50 D2 9E BB C1 C3 9C E5 1F 95 0A AD 03 90 24 91 38 AE 9F BE DF 66 BA A7 8A 62 1F 5A
I'm still at a loss as to what this information is - I've tried converting it to a wav file using ffmpeg (assuming is pcm) but it still comes out as white noise.
EDIT 2:
So it's occurred to me that it might help if I include more samples of the data - the decoded hex of the data can be found here (each sample separated by a new line character):
pastebin
I've noticed that each one seems to start with the following hex:
BB 3F 80 0A 01 00 10 01 0B 80 3E 04
Which translates to:
»?€
�€>
I'm still at a loss as to how to convert this to audio data.
EDIT 3:
I've uploaded some more datadumps to the following pastebin (More data), it's not a full dump as it's roughly 15mb and pastebin crashed when I was trying to paste!
The data file is a dota2 demo file (extension .dem) which is a collection of protobuf messages that I parse using GoLang and the Manta replay parse (found here). This allows me to pull out any type of message, and I select OnCSVCMsg_VoiceData, which returns m.Audio.VoiceData of the form: CSVCMsg_VoiceData (the struct I display above).
EDIT 4
Here's (finally) the link to the file with the concatenated voiceData messages.
And here's the link to the original file of protobuff messages

hairlessbear · Answer

TL;DR Each section n indicates a separate stream of data The sequence_bytes value indicates the order that the frames should be placed in when decoding. The voice_data is base64-encoded The decoded data is a SILK-encoded frame, but with the following exceptions: The first 14 bytes The last 4 bytes To decode the data, you must do the following: For each section n, order n's structs in ascending order based on the value of sequence_bytes De-base64 each struct's voice_data Extract the SILK payload from each struct (i.e. remove the first 14 bytes and the last 4 bytes) and concatenate them all together (again, must be in order based on sequence_bytes) Prepend the resulting file with #!SILK_V3 (the SILK header) You now have a valid SILK file that can be decoded (details below) Long version Using the sample data you posted, first thing I had to do was replace the final comma with a ] to make it valid JSON. I originally used shell scripts to to convert the structs from JSON to SILK, but in the interest of efficiency, I re-implemented the conversion in Python. import json import base64 import sys def main(): if len(sys.argv) < 2: print("Usage: python3", sys.argv[0], "") exit(1) with open(sys.argv[1], 'r') as infile: json_data = json.load(infile) # Create dictionary with section number as the key and list of # that section's structs as the value section_dict = {} for obj in json_data: sec_num = obj['section_number'] if sec_num not in section_dict: section_dict[sec_num] = [] section_dict[sec_num].append(obj) # Create SILK file for each section number stream for section in section_dict.keys(): filename=f"section_{section}.slk" print(f"Generating SILK file {filename} for section {section}...") with open(filename, 'wb') as outfile: # SILK header outfile.write(b"#!SILK_V3") # Sort frames in ascending order based on sequence_bytes value for frame in sorted(section_dict[section], key=lambda x : x['sequence_bytes']): decoded = base64.b64decode(frame['voice_data']) # strip first 14 bytes and last 4 bytes before writing outfile.write(decoded[14:-4]) if __name__ == '__main__': main() To decode SILK, I used the official SDK (that's what the decoder linked by Gordon Freeman is built on top of). The SDK can be downloaded from this link, which I found from this page. After I downloaded the SDK, I extracted it, went into the directory named SILK_SDK_SRC_FIX_v1.0.9, and ran make (I'm on Kali, but pretty much any Linux variant should be fine). Once make completes, you're left with a couple executables; the only one we care about is decoder. Simply run decoder on the SILK payloads generated above, and you'll get a pcm file you can do whatever you want with. For example, ./decoder section_12.slk section_12.pcm. The output file is at 22050 Hz. Hat tip to @Gordon Freeman for pointing out that the header isn't 18 bytes like I originally suspected and that the last 4 bytes aren't part of the SILK payload. Old shell scripts For posterity, here's how I converted the JSON to SILK files with shell scripts. I used the following script to extract the data, de-base64 it, and put each struct's data in its own file. #!/bin/bash # Write each decoded VoiceData to a file with the naming convention # _ write_data () { filename=`echo $1 | cut -d_ -f1,2` data=`echo $1 | cut -d_ -f3` echo -n "$data" | base64 -d > $filename } export -f write_data jq -r '.[] | "(.sequence_bytes)_(.section_number)_(.voice_data)"' dota2CasterParse.json | xargs -I '{}' bash -c "write_data '{}'" I then used the following script to create a SILK file for each section: #!/bin/bash section_numbers=$(ls [0-9]*_[0-9]* | cut -d_ -f2 | sort -u) for section in $section_numbers; do output="section_${section}_voiceData.slk" echo -n '#!SILK_V3' > $output for i in $(ls *_${section} | sort -n); do dd bs=1 skip=14 count=$(($(stat -c "%s" $i)-18)) if=$i of=$output conv=notrunc oflag=append done done

Gordon Freeman · Answer

There are 3 types of "frame", i guess 3 casters
BB 3F 80 0A 01 00 10 01 0B 80 3E 04 42 01  (@ 0x0)
D8 76 DD 02 01 00 10 01 0B 80 3E 04 FA 01  (@ 0x5f0)
67 7D 11 05 01 00 10 01 0B 80 3E 04 7E 01  (@ 0x44ccf)
Example for the first one:
BB 3F 80 0A identifier of the caster
01 channel number mono
80 3E = 0x3e80 =16000 the rate
42 01 = 0x142 the size of silk data
After the size the following 0x142 bytes are the datas of silk file
just add it silk header #!SILK_V3
23 21 53 49 4C 4B 5F 56 33
I use silk_v3_decoder.exe (? some python script can do it)
silk_v3_decoder.exe in.hex out.pcm -Fs_API 16000
then
ffmpeg  -f s16le -ar 16000 -ac 1 -i out.pcm out.wav
A frame represents a short time, so all the data must be concatenated
(as said hairlessbear)
Nota: at the end of the "frame" there is 4 bytes could be checksum

Turning an unknown audio data stream into wav or similar format

2 Answers

Old shell scripts

Add your own answers!

Ask a Question