For those interested, a member of the forum has an incredible tool called StoryToolkitAI.
It was created to fill the need for transcriptions.
In some aspects, this tool is way ahead of the one in Resolve. But it's (obviously) less integrated than the native tool.
viewtopic.php?f=21&t=168403&hilit=storytoolkitI use it almost every day. Mostly for its ability to use the model "Large-V3," which is very, very good (and better than the one shipped with resolve) to identify correct words in noisy or quiet environments.
We can use it outside of Resolve too and batch process multiple files. The SRTS will be saved next to the original video files.
WhisperX (a fork of Whisper) got updated more recently, and the transcription really fast now.
https://github.com/m-bain/whisperXThis repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
- Batched inference for 70x realtime transcription using whisper large-v2
- faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
- Accurate word-level timestamps using wav2vec2 alignment
- Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
- VAD preprocessing, reduces hallucination & batching with no WER degradation
If anyone is interested, I have a.bat script in the "SendTo" folder of Windows (so it appears under a right click on a file):
- Code: Select all
@echo off
setlocal enabledelayedexpansion
rem Activate the Python virtual environment.
call "C:\WhisperX\venv\Scripts\activate"
rem Prompt the user to choose the model or use default.
set /P "model=Enter model (press Enter for large-v3)"
if "!model!"=="" set "model=large-v3"
rem Process each selected file.
for %I in (%*) do (
rem Get the file name of the current selected file without extension.
set "filename=%%~nI"
set "extension=%%~xI"
rem Run the WhisperX command with the current selected file and chosen model.
whisperx "%%~fI" --output_format srt --model !! --verbose True --fp16 True --compute_type float16 --print_progress True --batch_size 18
)
pause
Btw, the --batch_size 18 is specific to me since I have enough Vram with the 3090 to support some of the options and models. (Especially if multiple scripts are run at the same time. - Note :
I share this for those who have python already installed (or know how to do it), have or know how to create a virtual environement with python and install what's needed from github (and the proper Torch, Torchvision, etc . for their hardware) -
https://github.com/m-bain/whisperXI'm not a specialist in these things, so it's pretty much me having a super basic knowledge of scripting to automate things as much as possible.
I have two other ones I place manually in a folder with dozens of files to transcribe.
This one puts all the filenames based on these extensions in a simple text file.
- Code: Select all
@echo off
dir *.mov *.mp4 *.mp3 *.wav *.mkv *.webm /b > files_list.txt
And this one opens the file to transcribe each file one after another.
- Code: Select all
@echo off
set /p filename="Please enter the filename: "
for /F "delims=" %%F in (%filename%) do (
echo %%F
whisperx "%%F" --output_type srt --output_dir subtitles --model medium.en
)
I did it this way, so I could just manually split the content of files_list.txt into smaller files and run multiple instances of the.bat script. - I used the medium.en model for this older script because, at the time, the transcription was way slower than now. And the large model didn't make too much of a difference in terms of quality. When the audio is really good, medium is totally fine.
I know there are tons of better ways to do that, but hacking it that way saved me hours already; it's fine for my use case. I'm sharing that because maybe it would give some ideas to some people who are really good with coding.
IF for some reason anyone tries to use my stuff (good luck lol), and if it works for you, what's in the blue circle is "fine". I didn't have any problem with it.
I removed an argument (--align_model WAV2VEC2_ASR_LARGE_LV60K_960H). On the github page they say :
For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
I kept it that way for a long time, but things changed since then ... Anyway, it's not really needed (but I kept it myself when I use WhisperX).
--output_type srt can be removed too if needed (or replaced by one of the supported format), it will save transcriptions in all the supported format by WhisperX, like : txt,vtt,srt,tsv,json
On a 1h30 interview, it took me less than 2 minutes to transcribe with the model Large-V3 - with a 3090 - if You add the "--align_model WAV2VEC2_ASR_LARGE_LV60K_960H", it's slightly longer, but not more than 1-2 mins max (for me). I didn't try with a bigger batch size because it's fast enough already.
Very, very last thing. This is not perfect! The transcription tool in Resolve is tweaked in a way that caption won't go into crazy length! It's better suited for video editing and real captioning.
So keep this in mind. There are pros and cons. I personally use WhisperX because I can transcribe hundreds of clips outside of Resolve, the model Large-v3 is way better than the one shipped with Resolve, and I don't use it for captioning but to have a text file of what's said in my video files (I use StoryToolkitAI which use Whisper under the hood, and it's more in line with what Resolve does).