Breaking Eggs And Making Omelettes

A blog dealing with technical multimedia matters, binary reverse engineering, and the occasional video game hacking.

http://multimedia.cx/eggs/

Les articles publiés sur le site

  • Reverse Engineering Italian Literature

    1er juillet 2014, par Multimedia MikeReverse Engineering

    Some time ago, Diego “Flameeyes” Pettenò tried his hand at reverse engineering a set of really old CD-ROMs containing even older Italian literature. The goal of this RE endeavor would be to extract the useful literature along with any structural metadata (chapters, etc.) and convert it to a more open format suitable for publication at, e.g., Project Gutenberg or Archive.org.

    Unfortunately, the structure of the data thwarted the more simplistic analysis attempts (like inspecting for blocks of textual data). This will require deeper RE techniques. Further frustrating the effort, however, is the fact that the binaries that implement the reading program are written for the now-archaic Windows 3.1 operating system.

    In pursuit of this RE goal, I recently thought of a way to glean more intelligence using DOSBox.

    Prior Work
    There are 6 discs in the full set (distributed along with 6 sequential issues of a print magazine named L’Espresso). Analysis of the contents of the various discs reveals that many of the files are the same on each disc. It was straightforward to identify the set of files which are unique on each disc. This set of files all end with the extension “LZn”, where n = 1..6 depending on the disc number. Further, the root directory of each disc has a file indicating the sequence number (1..6) of the CD. Obviously, these are the interesting targets.

    The LZ file extensions stand out to an individual skilled in the art of compression– could it be a variation of the venerable LZ compression? That’s actually unlikely because LZ — also seen as LIZ — stands for Letteratura Italiana Zanichelli (Zanichelli’s Italian Literature).

    The Unix ‘file’ command was of limited utility, unable to plausibly identify any of the files.

    Progress was stalled.

    Saying Hello To An Old Frenemy
    I have been showing this screenshot to younger coworkers to see if any of them recognize it:


    DOSBox running Window 3.1

    Not a single one has seen it before. Senior computer citizen status: Confirmed.

    I recently watched an Ancient DOS Games video about Windows 3.1 games. This episode showed Windows 3.1 running under DOSBox. I had heard this was possible but that it took a little work to get running. I had a hunch that someone else had probably already done the hard stuff so I took to the BitTorrent networks and quickly found a download that had the goods ready to go– a directory of Windows 3.1 files that just had to be dropped into a DOSBox directory and they would be ready to run.

    Aside: Running OS software procured from a BitTorrent network? Isn’t that an insane security nightmare? I’m not too worried since it effectively runs under a sandboxed virtual machine, courtesy of DOSBox. I suppose there’s the risk of trojan’d OS software infecting binaries that eventually leave the sandbox.

    Using DOSBox Like ‘strace’
    strace is a tool available on some Unix systems, including Linux, which is able to monitor the system calls that a program makes. In reverse engineering contexts, it can be useful to monitor an opaque, binary program to see the names of the files it opens and how many bytes it reads, and from which locations. I have written examples of this before (wow, almost 10 years ago to the day; now I feel old for the second time in this post).

    Here’s the pitch: Make DOSBox perform as strace in order to serve as a platform for reverse engineering Windows 3.1 applications. I formed a mental model about how DOSBox operates — abstracted file system classes with methods for opening and reading files — and then jumped into the source code. Sure enough, the code was exactly as I suspected and a few strategic print statements gave me the data I was looking for.

    Eventually, I even took to running DOSBox under the GNU Debugger (GDB). This hasn’t proven especially useful yet, but it has led to an absurd level of nesting:


    GDB runs DOSBox runs Windows 3.1

    The target application runs under Windows 3.1, which is running under DOSBox, which is running under GDB. This led to a crazy situation in which DOSBox had the mouse focus when a GDB breakpoint was triggered. At this point, DOSBox had all desktop input focus and couldn’t surrender it because it wasn’t running. I had no way to interact with the Linux desktop and had to reboot the computer. The next time, I took care to only use the keyboard to navigate the application and trigger the breakpoint and not allow DOSBox to consume the mouse focus.

    New Intelligence

    By instrumenting the local file class (virtual HD files) and the ISO file class (CD-ROM files), I was able to watch which programs and dynamic libraries are loaded and which data files the code cares about. I was able to narrow down the fact that the most interesting programs are called LEGGENDO.EXE (‘reading’) and LEGGENDA.EXE (‘legend’; this has been a great Italian lesson as well as RE puzzle). The first calls the latter, which displays this view of the data we are trying to get at:


    LIZ: Authors index

    When first run, the program takes an interest in a file called DBBIBLIO (‘database library’, I suspect):

    === Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x0
    === Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x151
    === Read('LIZ98\DBBIBLIO.LZ1'): req 337 bytes; read 337 bytes from pos 0x2A2
    [...]
    

    While we were unable to sort out all of the data files in our cursory investigation, a few things were obvious. The structure of this file looked to contain 336-byte records. Turns out I was off by 1– the records are actually 337 bytes each. The count of records read from disc is equal to the number of items shown in the UI.

    Next, the program is interested in a few more files:

    *** isoFile(): 'DEPOSITO\BLOKCTC.LZ1', offset 0x27D6000, 2911488 bytes large
    === Read('DEPOSITO\BLOKCTC.LZ1'): req 96 bytes; read 96 bytes from pos 0x0
    *** isoFile(): 'DEPOSITO\BLOKCTX0.LZ1', offset 0x2A9D000, 17152 bytes large
    === Read('DEPOSITO\BLOKCTX0.LZ1'): req 128 bytes; read 128 bytes from pos 0x0
    === Seek('DEPOSITO\BLOKCTX0.LZ1'): seek 384 (0x180) bytes, type 0
    === Read('DEPOSITO\BLOKCTX0.LZ1'): req 256 bytes; read 256 bytes from pos 0x180
    === Seek('DEPOSITO\BLOKCTC.LZ1'): seek 1152 (0x480) bytes, type 0
    === Read('DEPOSITO\BLOKCTC.LZ1'): req 32 bytes; read 32 bytes from pos 0x480
    === Read('DEPOSITO\BLOKCTC.LZ1'): req 1504 bytes; read 1504 bytes from pos 0x4A0
    [...]
    
    

    Eventually, it becomes obvious that BLOKCTC has the juicy meat. There are 32-byte records followed by variable-length encoded text sections. Since there is no text to be found in these files, the text is either compressed, encrypted, or both. Some rough counting (the program seems to disable copy/paste, which thwarts more precise counting), indicates that the text size is larger than the data chunks being read from disc, so compression seems likely. Encryption isn’t out of the question (especially since the program deems it necessary to disable copy and pasting of this public domain literary data), and if it’s in use, that means the key is being read from one of these files.

    Blocked On Disassembly
    So I’m a bit blocked right now. I know exactly where the data lives, but it’s clear that I need to reverse engineer some binary code. The big problem is that I have no idea how to disassemble Windows 3.1 binaries. These are NE-type executable files. Disassemblers abound for MZ files (MS-DOS executables) and PE files (executables for Windows 95 and beyond). NE files get no respect. It’s difficult (but not impossible) to even find data about the format anymore, and details are incomplete. It should be noted, however, the DOSBox-as-strace method described here lends insight into how Windows 3.1 processes NE-type EXEs. You can’t get any more authoritative than that.

    So far, I have tried the freeware version of IDA Pro. Unfortunately, I haven’t been able to get the program to work on my Windows machine for a long time. Even if I could, I can’t find any evidence that it actually supports NE files (the free version specifically mentions MZ and PE, but does not mention NE or LE).

    I found an old copy of Borland’s beloved Turbo Assembler and Debugger package. It has Turbo Debugger for Windows, both regular and 32-bit versions. Unfortunately, the normal version just hangs Windows 3.1 in DOSBox. The 32-bit Turbo Debugger loads just fine but can’t load the NE file.

    I’ve also wondered if DOSBox contains any advanced features for trapping program execution and disassembling. I haven’t looked too deeply into this yet.

    Future Work
    NE files seem to be the executable format that time forgot. I have a crazy brainstorm about repacking NE files as MZ executables so that they could be taken apart with an MZ disassembler. But this will take some experimenting.

    If anyone else has any ideas about ripping open these binaries, I would appreciate hearing them.

    And I guess I shouldn’t be too surprised to learn that all the literature in this corpus is already freely available and easily downloadable anyway. But you shouldn’t be too surprised if that doesn’t discourage me from trying to crack the format that’s keeping this particular copy of the data locked up.

  • Playing With Emscripten and ASM.js

    1er mars 2014, par Multimedia MikeGeneral

    The last 5 years or so have provided a tremendous amount of hype about the capabilities of JavaScript. I think it really kicked off when Google announced their Chrome web browser in September, 2008 along with its V8 JS engine. This seemed to spark an arms race in JS engine performance along with much hyperbole that eventually all software could, would, and/or should be written in straight JavaScript for maximum portability and future-proofing, perhaps aided by Emscripten, a tool which magically transforms C and C++ code into JS. The latest round of rhetoric comes courtesy of something called asm.js which purports to narrow the gap between JS and native code performance.

    I haven’t been a believer, to express it charitably. But I wanted to be certain, so I set out to devise my own experiment to test modern JS performance.

    Up Front Summary
    I was extremely surprised that my experiment demonstrated JS performance FAR beyond my expectations. There might be something to these claims of magnficent JS speed in numerical applications. Basically, here were my thoughts during the process:

    • There’s no way that JavaScript can come anywhere close to C performance for a numerically intensive operation; a simple experiment should demonstrate this.
    • Here’s a straightforward C program to perform a simple yet numerically intensive operation.
    • Let’s compile the C program on gcc and get some baseline performance numbers.
    • Let’s use Emscripten to convert the C program to JavaScript and run it under Chrome.
    • Ha! Pitiful JS performance, just as I expected!
    • Try the same program under Firefox, since Firefox is supposed to have some crazy optimization for asm.js code, allegedly emitted by Emscripten.
    • LOL! Firefox performs even worse than Chrome!
    • Wait a minute… the Emscripten documentation mentioned using optimization levels for generating higher performance JS, so try ‘-O1′.
    • Umm… wow: Chrome’s performance increased dramatically! What about Firefox? Not only is Firefox faster than Chrome, it’s faster than the gcc-generated code!
    • As my faith in C is suddenly shaken to its core, I remembered to compile the gcc version with an explicit optimization level. The native C version pulled ahead of Firefox again, but the Firefox code is still close.
    • Aha! This is just desktop– but what about mobile? One of the leading arguments for converting everything to pure JavaScript is that such programs will magically run perfectly in mobile browsers. So I wager that this is where the experiment will fall over.
    • I proceed to try the same converted program on a variety of mobile platforms.
    • The mobile platforms perform rather admirably as well.
    • I am surprised.

    The Experiment
    I wanted to run a simple yet numerically-intensive and relevant benchmark, and something I am familiar with. I settled on JPEG image decoding. Again, I wanted to keep this simple, ideally in a single file because I didn’t know how hard it might be to deal with Emscripten. I found NanoJPEG, which is a straightforward JPEG decoder contained in a single C file.

    I altered nanojpeg.c (to a new file called nanojpeg-static.c) such that the main() program would always load a 1920×1080 (a.k.a. 1080p) JPEG file (“bbb-1080p-title.jpg”, the Big Buck Bunny title), rather than requiring a command line argument. Then I used gettimeofday() to profile the core decoding function (njDecode()).

    Compiling with gcc and profiling execution:

    gcc -Wall nanojpeg-static.c -o nanojpeg-static
    ./nanojpeg-static
    

    Optimization levels such as -O0, -O3, or -Os can be applied to the compilation command.

    For JavaScript conversion, I installed Emscripten and converted using:

    /path/to/emscripten/emcc nanojpeg-static.c -o nanojpeg.html \
      --preload-file bbb-1080p-title.jpg -s TOTAL_MEMORY=32000000
    

    The ‘–preload-file’ option makes the file available to the program via standard C-style file I/O functions. The ‘-s TOTAL_MEMORY’ was necessary because the default of 16 MB wasn’t enough. Again, the -O optimization levels can be sent in.

    For running, the .html file is loaded (via webserver) in a web browser.

    Want To Try It Yourself?
    I put the files here: http://multimedia.cx/emscripten/. The .c file, the JPEG file, and the Emscripten-converted files using -O0, -O1, -O2, -O3, -Os, and no optimization switch.

    Results and Charts
    Here is the spreadsheet with the raw results.

    I ran this experiment using Ubuntu Linux 12.04 on an Intel Atom N450-based netbook. For this part, I was able to compare the Chrome and Firefox browser results against the C results:



    These are the results for a 2nd generation Android Nexus 7 using both Chrome and Firefox:



    Here is the result for an iPad 2 running iOS 7 and Safari– there is no Firefox for iOS and while there is a version of Chrome for iOS, it apparently isn’t able to leverage an optimized JS engine. Chrome takes so long to complete this experiment that there’s no reason to muddy the graph with the results:



    Interesting that -O1 tends to provide better optimization than levels 2 or 3, and that -Os (optimize for size) seems to be a good all-around choice.

    Don’t Get Too Smug
    JavaScript can indeed get amazing performance in this day and age. Please be advised, however, that this isn’t the best that a C decoder implementation can possibly do. This version doesn’t leverage any SIMD extensions. According to profiling (using gprof against the C code), sample saturation in color conversion dominates followed by inverse DCT functions, common cases for SIMD ASM or intrinsics. Allegedly, there will be some support for JS SIMD optimizations some day. We’ll see.

    Implications For Development
    I’m still not especially motivated to try porting the entire Native Client game music player codebase to JavaScript. I’m still wondering about the recommended development flow. How are you supposed to develop for Emscripten and asm.js? From what I can tell, Emscripten is not designed as a simple aide for porting C/C++ code to JS. No, it reduces the code into JS code you can’t possibly maintain. This seems to imply that the C/C++ code needs to be developed and debugged in its entirety and then converted to JS, which seems arduous.

  • Long Overdue MediaWiki Upgrade

    5 février 2014, par Multimedia MikeGeneral

    What do I do? What I do? This library book is 42 years overdue!
    I admit that it’s mine, yet I can’t pay the fine,
    Should I turn it in or should I hide it again?
    What do I do? What do I do?

    I internalized the forgoing paean to the perils of procrastination by Shel Silverstein in my formative years. It’s probably why I’ve never paid a single cent in late fees in my entire life.

    However, I have been woefully negligent as the steward of the MediaWiki software that drives the world famous MultimediaWiki, the internet’s central repository of obscure technical knowledge related to multimedia. It is currently running of version 1.6 software. The latest version is 1.22.

    The Story So Far
    According to my records, I first set up the wiki late in 2005. I don’t know which MediaWiki release I was using at the time. I probably conducted a few upgrades in the early days, but that went by the wayside perhaps in 2007. My web host stopped allowing shell access and the MediaWiki upgrade process pretty much requires running a PHP script from a command line. Upgrade time came around and I put off the project. Weeks turned into months turned into years until, according to some notes, the wiki abruptly stopped working in July, 2011. Suddenly, there were PHP errors about “Namespace” being a reserved word.

    While I finally laid out a plan to upgrade the wiki after all these years, I eventually found that the problem had been caused when my webhost upgraded from PHP 5.2 -> 5.3. I also learned of a small number of code changes that caused the problem to go away, thus kicking the can down the road once more.

    Then a new problem showed up last week. I think it might be related to a new version of PHP again. This time, a few other things on my site broke, and I learned that my webhost now allows me to select a PHP version to use (with the version then set to “auto”, which didn’t yield much information). Rolling back to an earlier version of PHP might have solved the problem easily.

    But NO! I made the determination that this goes no further. I want this wiki upgraded.

    The Arduous Upgrade Path
    There are 2 general upgrade paths I can think of:

    1. Upgrade in place on the server
    2. Upgrade offline and put the site back on the server

    Approach #1 is problematic since I don’t have direct shell access, though I considered using something like PHP Shell. Approach #2 involves getting the entire set of wiki files and a backup of the MySQL tables. This is workable since I keep automated backups of these items anyway.

    In fairly short order, I was able to set up a working copy of the MultimediaWiki hosted on a local Linux machine. Now what’s the move? The MediaWiki software I’m running is 1.6.10. The very latest, as of this upgrade project is 1.22.2. I suppose it’s way too much to hope that the software will upgrade cleanly from 1.6.x straight to 1.22.x, but I guess it’s worth a shot…

    HA! No chance. Okay, next idea is to march through the various versions and upgrade each in turn. MediaWiki has all their historic releases online, all the way back to the 1.3 lineage. I decided that the latest of each lineage should upgrade cleanly from anything in the previous version of lineage. E.g., 1.6.10 should upgrade cleanly to 1.7.3 (last in the 1.7 series). This seemed to be a workable strategy. So I downloaded the latest of each series, unpacked, and copied all the wiki files over the working installation and ran ‘php update.php’ in the maintenance/ directory.

    The process is tedious and not without its obstacles. I consider this penance for my years of wiki neglect. First, I run into the “PHP Parse error: syntax error, unexpected T_NAMESPACE, expecting T_STRING” issue, the same that I saw years ago after the webhost transitioned from PHP 5.2 -> 5.3. I could solve this by editing assorted files and changing “Namespace” -> “MWNamespace” (which is what MediaWiki did by version 1.13). But I would prefer not to.

    Instead, I downloaded the source for PHP 5.2 and compiled it in a separate directory, then called ‘/path/to/php/5.2/bin/php update.php’. Problem solved.

    The next problem is that a bunch of the database update scripts are specifying “Type=InnoDB”. This isn’t supported by modern MySQL databases. Now, it’s “Engine=InnoDB”. A quick search & replace at the command line fixes this for 1.6.x… and 1.7.x… and 1.8 through 1.12. Finally, at 1.13, it was no longer necessary. As a bonus, at 1.13, I was able to test the installation since Namespace had been renamed to MWNamespace. I would later learn that the table type modifications probably could have been simplified in by changing “$wgDBmysql4 = true;” to “$wgDBmysql5 = true;” somewhere in LocalSettings.php.

    Command line upgrading worked smoothly up through 1.18 series when I got a new syntax error:


    PHP Fatal error: Call to a member function addMessages() on a non-object in /mnt/sdb1/archive/wiki/extensions/Cite.php on line 68

    Best I could do was comment out that line. I hope that doesn’t break anything important.

    In the home stretch, the very last transition (1.21 -> 1.22) failed:

    PHP Fatal error:  Cannot redeclare wfProfileIn() (previously declared in 
    /mnt/sdb1/archive/wiki/includes/profiler/Profiler.php:33) in 
    /mnt/sdb1/archive/wiki/includes/ProfilerStub.php on line 25
    

    Apparently, this problem arises occasionally since 1.18. I found a way around it thanks to this page: Deleted the file StartProfiler.php. Who am I to argue?

    Upon completing the transition to 1.22, the wiki doesn’t look correct– the pictures aren’t showing up. The solution was to fix the temporary directory via LocalSettings.php.

    Back To Production
    Okay, it all works again! Locally, that is. How to get it back to the server? My first idea was that, knowing that this upgrade process can succeed, try stepping through the upgrade process again, but tell the update.php scripts to access the database tables on multimedia.cx. This seemed to be working for awhile, even though the database update phase often took 4-5 minutes. However, the transition from 1.8.5 -> 1.9.6 took 75 minutes and then timed out. According to my notes, “This isn’t going to work.”

    The new process:

    1. Dump the database tables from the local database.
    2. Create a new database remotely (melanson_wiki_ng).
    3. Dump the database table into melanson_wiki_ng.
    4. Move the index.php file out of the wiki files directory temporarily (or rename).
    5. Modify the LocalSettings.php to talk to the new database.
    6. Perform a lftp mirror operation in order to send all the files up to the server.
    7. Send the index.php file and hope beyond hope that everything magically works.

    And that’s the story of how the updated MultimediaWiki came back online. Despite the database dump file being over 110 MB, it only tool MySQL 1m45s to transmit it all to the remote server (let’s hear it for the ‘–compress’ option). For comparison, inserting the tables back into a fresh local database took 1m07s.

    When the MultimediaWiki was first live again, it loaded, but ever so slowly. This is when I finally looked into optimization and found that I was lacking any caching. So as a bonus, the MultimediaWiki should be much faster now.

    Going Forward
    For all I know, I did everything described here in the hardest way possible. But at least I got it done. Unless I learn of a better process, future upgrades will probably look similar to this.

    Additionally, I should probably take some time to figure out what new features are part of the standard MediaWiki distribution nowadays.

  • Chrome’s New Audio Notifier

    30 janvier 2014, par Multimedia MikeGeneral

    Version 32 of Google’s Chrome web browser introduced this nifty feature:


    Chrome audio notifier icon

    When a browser tab has an element that is producing audio, the browser’s tab shows the above audio notification icon to inform the user. I have seen that people have a few questions about this, specifically:

    1. How does this feature work?
    2. Why wasn’t this done sooner?
    3. Are other browsers going to follow suit?

    Short answers: 1) Chrome offers a new plugin API that the Flash Player is now using, as are Chrome’s internal media playing facilities; 2) this feature was contingent on the new plugin infrastructure mentioned in the previous answer; 3) other browsers would require the same infrastructure support.

    Longer answers follow…

    Plugin History
    Plugins were originally based on the Netscape Plugin API. This was developed in the early 1990s in order to support embedding PDFs into the Netscape web browser. The NPAPI does things like providing graphics contexts for drawing and input processing, and mediate network requests through the browser’s network facilities.

    What NPAPI doesn’t do is handle audio. In the early-mid 1990s, audio support was not a widespread consideration in the consumer PC arena. Due to the lack of audio API support, if a plugin wanted to play audio, it had to go outside of the plugin framework.


    NPAPI plugin model

    There are a few downsides to this approach:

    So that last item hopefully answers the question of why it has been so difficult for NPAPI-supporting browsers to implement what seems like it would be simple functionality, like implementing a per-tab audio notifier.

    Plugin Future
    Since Google released Chrome in an effort to facilitate advancements on the client side of the internet, they have made numerous efforts to modernize various legacy aspects of web technology. These efforts include the SPDY protocol, Native Client, WebM/WebP, and something call the Pepper Plugin API (PPAPI). This is a more modern take on the classic plugin architecture to supplant the aging NPAPI:


    PPAPI plugin model

    Right away, we see that the job of the plugin writer is greatly simplified. Where was this API years ago when I was writing my API jungle piece?

    The Linux version of Chrome was apparently the first version that packaged the Pepper version of the Flash Player (doing so fixed an obnoxious bug in the Linux Flash Player interaction with GTK). Now, it looks like Windows and Mac have followed suit. Digging into the Chrome directory on a Windows 7 installation:

    AppData\Local\Google\Chrome\Application\[version]\PepperFlash\pepflashplayer.dll

    This directory exists for version 31 as well, which is still hanging around my system.

    So, to re-iterate: Chrome has a new plugin API that plugins use to access the audio API. Chrome knows when the API is accessed and that allows the browser to display the audio notifier on a tab.

    Other Browsers
    What about other browsers? “Mozilla is not interested in or working on Pepper at this time. See the Chrome Pepper pages.”

  • Overthinking My Search Engine Problem

    31 décembre 2013, par Multimedia MikeGeneral

    I wrote a search engine for my Game Music Appreciation website, because the site would have been significantly less valuable without it (and I would eventually realize that the search feature is probably the most valuable part of this endeavor). I came up with a search solution that was a bit sketchy, but worked… until it didn’t. I thought of a fix but still searched for more robust and modern solutions (where ‘modern’ is defined as something that doesn’t require compiling a C program into a static CGI script and hoping that it works on a server I can’t debug on).

    Finally, I realized that I was overthinking the problem– did you know that a bunch of relational database management systems (RDBMSs) support full text search (FTS)? Okay, maybe you did, but I didn’t know this.

    Problem Statement
    My goal is to enable users to search the metadata (title, composer, copyright, other tags) attached to various games. To do this, I want to index a series of contrived documents that describe the metadata. 2 examples of these contrived documents, interesting because both of these games have very different titles depending on region, something the search engine needs to account for:

    system: Nintendo NES
    game: Snoopy's Silly Sports Spectacular
    author: None; copyright: 1988 Kemco; dumped by: None
    additional tags: Donald Duck.nsf Donald Duck
    
    system: Super Nintendo
    game: Arcana
    author: Jun Ishikawa, Hirokazu Ando; copyright: 1992 HAL Laboratory; dumped by: Datschge
    additional tags: card.rsn.gamemusic Card Master Cardmaster
    

    The index needs to map these documents to various pieces of game music and the search solution needs to efficiently search these documents and find the various game music entries that match a user’s request.

    Now that I’ve been looking at it for long enough, I’m able to express the problem surprisingly succinctly. If I had understood that much originally, this probably would have been simpler.

    First Solution & Breakage
    My original solution was based on SWISH-E. The CGI script was a C program that statically linked the SWISH-E library into a binary that miraculously ran on my web provider. At least, it ran until it decided to stop working a month ago when I added a new feature unrelated to search. It was a very bizarre problem, the details of which would probably bore you to tears. But if you care, the details are all there in the Stack Overflow question I asked on the matter.

    While no one could think of a direct answer to the problem, I eventually thought of a roundabout fix. The problem seemed to pertain to the static linking. Since I couldn’t count on the relevant SWISH-E library to be on my host’s system, I uploaded the shared library to the same directory as the CGI script and used dlopen()/dlsym() to fetch the functions I needed. It worked again, but I didn’t know for how long.

    Searching For A Hosted Solution
    I know that anything is possible in this day and age; while my web host is fairly limited, there are lots of solutions for things like this and you can deploy any technology you want, and for reasonable prices. I figured that there must be a hosted solution out there.

    I have long wanted a compelling reason to really dive into Amazon Web Services (AWS) and this sounded like a good opportunity. After all, my script works well enough; if I could just find a simple Linux box out there where I could install the SWISH-E library and compile the CGI script, I should be good to go. AWS has a free tier and I started investigating this approach. But it seems like a rabbit hole with a lot of moving pieces necessary for such a simple task.

    I had heard that AWS had something in this area. Sure enough, it’s called CloudSearch. However, I’m somewhat discouraged by the fact that it would cost me around $75 per month to run the smallest type of search instance which is at the core of the service.

    Finally, I came to another platform called Heroku. It’s supposed to be super-scalable while having a free tier for hobbyists. I started investigating FTS on Heroku and found this article which recommends using the FTS capabilities of their standard hosted PostgreSQL solution. However, the free tier of Postgres hosting only allows for 10,000 rows of data. Right now, my database has about 5400 rows. I expect it to easily overflow the 10,000 limit as soon as I incorporate the C64 SID music corpus.

    However, this Postgres approach planted a seed.

    RDBMS Revelation
    I have 2 RDBMSs available on my hosting plan– MySQL and SQLite (the former is a separate service while SQLite is built into PHP). I quickly learned that both have FTS capabilities. Since I like using SQLite so much, I elected to leverage its FTS functionality. And it’s just this simple:

    CREATE VIRTUAL TABLE gamemusic_metadata_fts USING fts3
    ( content TEXT, game_id INT, title TEXT );
    
    SELECT id, title FROM gamemusic_metadata_fts WHERE content MATCH "arcana";
    479|Arcana
    

    The ‘content’ column gets the metadata pseudo-documents. The SQL gets wrapped up in a little PHP so that it queries this small database and turns the result into JSON. The script is then ready as a drop-in replacement for the previous script.