Pandemic Library Tech: Converting YouTube Captions to Facebook Captions

Our library’s front-facing services and programs have been entirely digital for two months now. One thing that has taken us entirely too long to figure out has been how to provide decent captions / transcripts for our videos.

Sure, YouTube and Facebook can auto-caption all the video you live streamed through them, but it looks to me like Facebook requires you to tell it to do that for each video (based on this) and even then the captions are… well, they’re auto-captions. YouTube’s captions are better, but again–auto-captions. Not always intelligible, and definitely not as good as what you’d get from an actual human doing the captioning. But who can afford a captioning service?

Better *still*, YouTube and Facebook use different file types if you want to just upload some manually corrected captions to them. So if you have one video in both places, you can’t just correct a file once and use it in both places. Nope. That would be too easy.

So we’ve been trying to figure out a workflow that would let our less tech-savvy library assistants work on captions and transcripts for us. We’re making progress, but that’s not actually in a place where I want to share it. (One major consideration is: Who are we comfortable giving enough rights on our Facebook page that they can access captioning tools? Facebook page management is not for the technologically squeamish or the person who will fail to notice when they’re acting as the page vs. when they’re acting as themself.)

What I want to share is some quick and dirty regex for converting YouTube caption files (.sbv files) to Facebook caption files (.srt). Is this the most code efficient way of converting between the two file types? Absolutely not. Have I built in much in the way of protections checking to see if the original file was well formatted? Nope. This is, I emphasize, quick and dirty. Also it’s the first time I’ve used PHP in… gosh, several years, I think. Probably the past-me who was fluent and spent all kinds of time coding would cringe.

That said, I didn’t find example regex when I was looking, so… yeah. Anyway, without further ado:

//turn all the non-html new line characters into <br> so they'll display on this page
    $pattern = "/[\r\n]{1,2}/";
    $replacement = "<br>";
    $convertMe = preg_replace($pattern,$replacement,$convertMe);
    //turn the commas between the numbers into arrows
    $pattern = "/(\d)\,(\d)/";
    $replacement = "$1 --> $2";
    $convertMe = preg_replace($pattern,$replacement,$convertMe);
    //put the leading zero on other places it's needed, and comma-ify the //milliseconds
    $pattern = "/([0-9]{1}:)([0-9]{2}:[0-9]{2})\.([0-9]{3})/";
    $replacement = "0$1$2,$3";
    $convertMe = preg_replace($pattern,$replacement,$convertMe);
    //now number the captions for facebook / .srt
    //count how many <br><br> there are, so we know how many captions
    $howMany = substr_count($convertMe, "<br><br>") + 1;
    //the first caption is its own case
    $convertMe = "1<br>" . $convertMe;
    //number all the rest of the captions
    //first, fill an array with all the numbers we'll use
    $pattern = "/<br><br>(\d\d:)/";
    for($i = 2; $i <= $howMany; $i++){
        $replacement = "<br><br>" . $i . "<br>$1";
        $convertMe = preg_replace($pattern, $replacement, $convertMe, 1);