Using whisper.el to convert speech to text and save it to the currently clocked task in Org Mode or elsewhere
| emacs, audio, speechI want to get my thoughts into the computer quickly, and talking might be a good way to do some of that. OpenAI Whisper is reasonably good at recognizing my speech now and whisper.el gives me a convenient way to call whisper.cpp from Emacs with a single keybinding. Here is how I have it set up for reasonable performance on my Lenovo P52 with just the CPU, no GPU.
I've bound <f9> to the command whisper-run. I press <f9> to start recording, talk, and then press <f9> to stop recording. By default, it inserts the text into the buffer at the current point. I've set whisper-return-cursor-to-start to nil so that I can keep going.
(use-package whisper
:vc "https://github.com/natrys/whisper.el"
:load-path "~/vendor/whisper.el"
:config
(setq whisper-quantize "q4_0")
(setq whisper-install-directory "~/vendor")
;; Get it running with whisper-server-mode set to nil first before you switch to 'local.
;; If you change models,
;; (whisper-install-whispercpp (whisper--check-install-and-run nil "whisper-start"))
(setq whisper-server-mode 'local)
(setq whisper-model "base")
(setq whisper-return-cursor-to-start nil)
(setq whisper--ffmpeg-input-device "alsa_input.usb-Blue_Microphones_Yeti_Stereo_Microphone_REV8-00.analog-stereo")
;(setq whisper--ffmpeg-input-device "VirtualMicSink.monitor")
(setq whisper--ffmpeg-input-device "audiorelay-virtual-mic-sink:monitor_FL")
(setq whisper-language "en")
(setq whisper-before-transcription-hook nil)
(setq whisper-use-threads (1- (num-processors)))
(setq whisper-transcription-buffer-name-function 'whisper--simple-transcription-buffer-name)
(add-hook 'whisper-after-transcription-hook 'my-subed-fix-common-errors-from-start)
:bind
(("<f9>" . whisper-run)
("C-<f9>" . my-whisper-org-capture-to-clock)
("S-<f9>" . my-whisper-replay)
("M-<f9>" . my-whisper-toggle-language)))
The technology isn't quite there yet to do real-time audio transcription so that I can see what it understands while I'm saying things, but that might be distracting anyway. If I do it in short segments, it might still be okay. I can replay the most recently recorded snippet in case it's missed something and I've forgotten what I just said.
(defun my-whisper-replay ()
"Replay the last temporary recording."
(interactive)
(mpv-play whisper--temp-file))
Il peut aussi comprendre le français.
(defun my-whisper-toggle-language ()
"Set the language explicitly, since sometimes auto doesn't figure out the right one."
(interactive)
(setq whisper-language (if (string= whisper-language "en") "fr" "en"))
;; If using a server, we need to restart for the language
(when (process-live-p whisper--server-process) (kill-process whisper--server-process))
(message "%s" whisper-language))
I could use this with org-capture, but that's a lot of keystrokes. My shortcut for org-capture is C-c r. I need to press at least one key to set the template, <f9> to start recording, <f9> to stop recording, and C-c C-c to save it. I want to be able to capture notes to my currently clocked in task without having an Org capture buffer interrupt my display.
To clock in, I can use C-c C-x i or my ! speed command. Bonus: the modeline displays the current task to keep me on track, and I can use org-clock-goto (which I've bound to C-c j) to jump to it.
Then, when I'm looking at something else and I want to record a note, I can press <f9> to start the recording, and then C-<f9> to save it to my currently clocked task along with a link to whatever I'm looking at.
(defvar my-whisper-org-target nil
"*Where to save the target.
Nil means jump to the current clocked-in entry and insert it along with
a link, or prompt for a capture template if nothing is clocked in.
If this is set to a string, it should specify a key from
`org-capture-templates'. The text will be in %i, and you can use %a for the link.
For example, you could have a template entry like this:
\(\"c\" \"Contents to current clocked task\" plain (clock) \"%i%?\n%a\" :empty-lines 1)
If this is set to a function, the function will be called from the
original marker with the text as the argument. Note that the window
configuration and message will not be preserved after this function is
run, so if you want to change the window configuration or display a
message, add a timer.")
(defun my-whisper-org-capture-to-clock ()
(interactive)
(require 'whisper)
(add-hook 'whisper-after-transcription-hook #'my-whisper-org-save 50)
(whisper-run))
(defun my-whisper-org-save ()
"Save the transcription."
(let ((text (string-trim (buffer-string))))
(remove-hook 'whisper-after-transcription-hook #'my-whisper-org-save)
(erase-buffer) ; stops further processing
(save-window-excursion
(with-current-buffer (marker-buffer whisper--marker)
(goto-char whisper--marker)
(cond
((functionp my-whisper-org-target)
(funcall my-whisper-org-target text))
(my-whisper-org-target
(setq org-capture-initial text)
(org-capture nil my-whisper-org-target)
(org-capture-finalize)
;; Delay the display of the message because whisper--cleanup-transcription clears it
(run-at-time 0.5 nil (lambda (text) (message "Captured: %s" text)) text))
((org-clocking-p)
(let ((link (org-store-link nil)))
(org-clock-goto)
(org-end-of-subtree)
(unless (bolp)
(insert "\n"))
(insert "\n" text "\n" link "\n"))
(run-at-time 0.5 nil (lambda (text) (message "Added clock note: %s" text)) text))
(t
(kill-new text)
(setq org-capture-initial text)
(call-interactively 'org-capture)
;; Delay the window configuration
(let ((config (current-window-configuration)))
(run-at-time 0.5 nil
(lambda (text config)
(set-window-configuration config)
(message "Copied: %s" text))
text config))))))))
Here's an idea for a my-whisper-org-target function that saves the recognized text with a timestamp.
(defvar my-whisper-notes "~/sync/stream/narration.org")
(defun my-whisper-save-to-file (text)
(let ((link (org-store-link nil)))
(with-current-buffer (find-file-noselect my-whisper-notes)
(goto-char (point-max))
(insert "\n\n" (format-time-string "%H:%M ") text "\n" link "\n")
(save-buffer)
(run-at-time 0.5 nil (lambda (text) (message "Saved to file: %s" text)) text))))
I think I've just figured out my Pipewire setup so that I can record audio in OBS while also being able to do speech to text, without the audio stuttering. qpwgraph was super helpful for visualizing the Pipewire connections and fixing them. Actually making a demonstration video will probably need to wait for another day, though!