Blackmagic Forum

Sat Aug 05, 2023 10:10 pm

Having to constantly wait on a 2-3h timeline to be transcribed is such a waste of time, when I could do so many things with Resolve at the same time. I have 50 like that...

And I'm not talking about rendering or editing anything, but adding markers, checking clips, etc.
But doing something more productive than waiting.

This needs to be worked on seriously for the next big update. These actions should not block the use of Resolve.

Rendering, transcribing, render in place, bouncing audio tracks, etc.

This is hours and hours of wasted time.

Sun Aug 06, 2023 2:25 am

Yeah I like the way Premiere handles it. It has a progress bar at the bottom of the Text window showing the progress of each transcription, and allows you to continue working while that goes.

Sent from my iPad using Tapatalk

Sun Aug 06, 2023 5:57 am

+1
This entirely outdated approach of blocking the UI for every task needs to go away, and have as many tasks as possible execute in the background.

They need to apply this philosophy across the UI in many areas. Exporting a large number of stills from the Color Page Gallery is another that comes to mind.

Sun Aug 06, 2023 10:54 am

eikonoklastes wrote:+1
This entirely outdated approach of blocking the UI for every task needs to go away, and have as many tasks as possible execute in the background.

They need to apply this philosophy across the UI in many areas. Exporting a large number of stills from the Color Page Gallery is another that comes to mind.

Exactly

Honestly, this should have been the first thing to do 4-5 years ago before anything else.
Decoupling everything that is UI related from these processes, and having a strong new base.
With a coherent interface between all the pages. Especially the color page which is so outdated (especially for the keyframe stuff).

Start fresh, then add all the new stuff.

I hope they have this in mind when they add all new tools, because the bigger the software becomes, the harder it's going to be in the future.

The growing frustration of the users is more and more palpable.

I personally see more "roadblocks" during my editing now, that when I started with v15.

Sun Aug 06, 2023 1:19 pm

1) hire a guy
2) get another system
3) resolve used to cost 20000$
4) manual is very long
5) waddabout resources
6) it isn’t professional
7) it is impossible
8) resolve isn’t premiere
9) mocha pro

Now that all the regular ”automation is bad mkay” are out of the way, and jokes aside, +1

Sun Aug 06, 2023 5:33 pm

Hendrik Proosa wrote:1) hire a guy
2) get another system
3) resolve used to cost 20000$

I just ordered 50 desktops, one for each timeline, with 50 32" screens (I couldn't "vote no"). 8-)

And I plan to limit each computer to 1 core one thread in the BIOS. I wouldn't want these computers to multi-task!

Sun May 05, 2024 11:22 pm

visualfeast wrote:Yeah I like the way Premiere handles it. It has a progress bar at the bottom of the Text window showing the progress of each transcription, and allows you to continue working while that goes.

Sent from my iPad using Tapatalk

Transcription could be part of the proxy generation tool, or it could be a tool somewhat similar to proxy generation. Imagine being able to point the transcription engine at the shared folder where you keep your source clips. It would just transcribe in the background using a spare computer or your main PC's overnight down-time.

At the moment, Transcribe is super fragile and a bit broken. If something causes a transcription to crash part way through a batch, this can create a massive time-suck. If you have to force-close Davinci Resolve, all your transcriptions are lost.

Tue May 07, 2024 10:01 am

For those interested, a member of the forum has an incredible tool called StoryToolkitAI.
It was created to fill the need for transcriptions.

In some aspects, this tool is way ahead of the one in Resolve. But it's (obviously) less integrated than the native tool.

viewtopic.php?f=21&t=168403&hilit=storytoolkit

I use it almost every day. Mostly for its ability to use the model "Large-V3," which is very, very good (and better than the one shipped with resolve) to identify correct words in noisy or quiet environments.

We can use it outside of Resolve too and batch process multiple files. The SRTS will be saved next to the original video files.

WhisperX (a fork of Whisper) got updated more recently, and the transcription really fast now. https://github.com/m-bain/whisperX

This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

- Batched inference for 70x realtime transcription using whisper large-v2
- faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
- Accurate word-level timestamps using wav2vec2 alignment
- Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
- VAD preprocessing, reduces hallucination & batching with no WER degradation

If anyone is interested, I have a.bat script in the "SendTo" folder of Windows (so it appears under a right click on a file):

Code: Select all: @echo off setlocal enabledelayedexpansion rem Activate the Python virtual environment. call "C:\WhisperX\venv\Scripts\activate" rem Prompt the user to choose the model or use default. set /P "model=Enter model (press Enter for large-v3)" if "!model!"=="" set "model=large-v3" rem Process each selected file. for %I in (%*) do ( rem Get the file name of the current selected file without extension. set "filename=%%~nI" set "extension=%%~xI" rem Run the WhisperX command with the current selected file and chosen model. whisperx "%%~fI" --output_format srt --model !! --verbose True --fp16 True --compute_type float16 --print_progress True --batch_size 18 ) pause

Btw, the --batch_size 18 is specific to me since I have enough Vram with the 3090 to support some of the options and models. (Especially if multiple scripts are run at the same time. - Note : I share this for those who have python already installed (or know how to do it), have or know how to create a virtual environement with python and install what's needed from github (and the proper Torch, Torchvision, etc . for their hardware) - https://github.com/m-bain/whisperX

I'm not a specialist in these things, so it's pretty much me having a super basic knowledge of scripting to automate things as much as possible.

I have two other ones I place manually in a folder with dozens of files to transcribe.

This one puts all the filenames based on these extensions in a simple text file.

Code: Select all: @echo off dir *.mov *.mp4 *.mp3 *.wav *.mkv *.webm /b > files_list.txt

And this one opens the file to transcribe each file one after another.

Code: Select all: @echo off set /p filename="Please enter the filename: " for /F "delims=" %%F in (%filename%) do ( echo %%F whisperx "%%F" --output_type srt --output_dir subtitles --model medium.en )

I did it this way, so I could just manually split the content of files_list.txt into smaller files and run multiple instances of the.bat script. - I used the medium.en model for this older script because, at the time, the transcription was way slower than now. And the large model didn't make too much of a difference in terms of quality. When the audio is really good, medium is totally fine.

I know there are tons of better ways to do that, but hacking it that way saved me hours already; it's fine for my use case. I'm sharing that because maybe it would give some ideas to some people who are really good with coding.

IF for some reason anyone tries to use my stuff (good luck lol), and if it works for you, what's in the blue circle is "fine". I didn't have any problem with it.

I removed an argument (--align_model WAV2VEC2_ASR_LARGE_LV60K_960H). On the github page they say :

For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

I kept it that way for a long time, but things changed since then ... Anyway, it's not really needed (but I kept it myself when I use WhisperX).

--output_type srt can be removed too if needed (or replaced by one of the supported format), it will save transcriptions in all the supported format by WhisperX, like : txt,vtt,srt,tsv,json

On a 1h30 interview, it took me less than 2 minutes to transcribe with the model Large-V3 - with a 3090 - if You add the "--align_model WAV2VEC2_ASR_LARGE_LV60K_960H", it's slightly longer, but not more than 1-2 mins max (for me). I didn't try with a bigger batch size because it's fast enough already.

Very, very last thing. This is not perfect! The transcription tool in Resolve is tweaked in a way that caption won't go into crazy length! It's better suited for video editing and real captioning.

So keep this in mind. There are pros and cons. I personally use WhisperX because I can transcribe hundreds of clips outside of Resolve, the model Large-v3 is way better than the one shipped with Resolve, and I don't use it for captioning but to have a text file of what's said in my video files (I use StoryToolkitAI which use Whisper under the hood, and it's more in line with what Resolve does).

Make the transcription do its thing in the background

Make the transcription do its thing in the background

Re: Make the transcription do its thing in the background

Re: Make the transcription do its thing in the background

Re: Make the transcription do its thing in the background

Re: Make the transcription do its thing in the background

Re: Make the transcription do its thing in the background

Re: Make the transcription do its thing in the background

Re: Make the transcription do its thing in the background

Who is online