How does text recognition from images work?

We use a process called Optical Character Recognition (OCR) that scans an image, searches for text in it, and then converts that text into a machine-readable format before returning it.

We can break down OCR into five different steps.

Step 1: Image Acquisition

The image is scanned and each of its pixels are replaced by either a black or white pixel.

Step 2: Pre-processing

Areas outside of the text are then removed from the image file.

Step 3: Segmentation

OCR will be segmenting to analyze and isolate individual characters and words.

Step 4: Feature Extraction

Every character is then recognized and converted to machine-readable text. OCR contains a multitude of fonts to compare and convert the scanned text. There are several different approaches how this stage is done.

The fast approach, also known as the traditional method, scans every single character and compares by function. This approach uses a small machine learning model to recognize each character.

The accurate approach scans the text line by line and then converting it, much like how aa human reads blocks of text. This approach uses a neural network to identify lines and then analyze them further to find words and sentences.

Step 5: Post-processing

This step is to check whether the computer systems have correctly recognized all the text in an image. You can manually correct if so.

Example Code:

//  ContentView.swift

import SwiftUI
import Vision

struct ContentView: View {
    @State private var image: UIImage?
    @State private var recognizedText = ""
    @State private var showPicker = false
    @State private var useCamera = false

    var body: some View {
        NavigationView {
            VStack(spacing: 20) {
                if let img = image {
                    Image(uiImage: img)
                        .resizable()
                        .scaledToFit()
                        .frame(maxHeight: 300)
                        .cornerRadius(8)
                } else {
                    Rectangle()
                        .fill(Color.secondary.opacity(0.2))
                        .frame(height: 300)
                        .overlay(Text("No Image Selected").foregroundColor(.secondary))
                        .cornerRadius(8)
                }

                Button("Select Photo") {
                    useCamera = false
                    showPicker = true
                }
                Button("Take Photo") {
                    useCamera = true
                    showPicker = true
                }

                ScrollView {
                    Text(recognizedText)
                        .padding()
                        .frame(maxWidth: .infinity, alignment: .leading)
                        .background(Color.secondary.opacity(0.1))
                        .cornerRadius(8)
                }
                .frame(maxHeight: 200)

                Spacer()
            }
            .padding()
            .navigationTitle("Secret Decoder")
            .sheet(isPresented: $showPicker) {
                ImagePicker(sourceType: useCamera ? .camera : .library) { img in
                    self.image = img
                    recognizeText(in: img)
                }
            }
        }
    }

    func recognizeText(in image: UIImage) {
        recognizedText = "Decoding…"
        guard let cgImage = image.cgImage else { return }

        let request = VNRecognizeTextRequest { req, err in
            guard let results = req.results as? [VNRecognizedTextObservation],
                  err == nil else {
                DispatchQueue.main.async { recognizedText = "Recognition failed." }
                return
            }
            let lines = results.compactMap { obs in
                obs.topCandidates(1).first?.string
            }
            DispatchQueue.main.async {
                recognizedText = lines.joined(separator: "\\n")
            }
        }
        request.recognitionLevel = .accurate
        // request.recognitionLanguages = ["en-US", "es"]  // multi‑lang

        let handler = VNImageRequestHandler(cgImage: cgImage, options: [:])
        DispatchQueue.global(qos: .userInitiated).async {
            try? handler.perform([request])
        }
    }
}

// MARK: - ImagePicker
struct ImagePicker: UIViewControllerRepresentable {
    enum Source { case camera, library }
    var sourceType: UIImagePickerController.SourceType
    var completion: (UIImage) -> Void

    init(sourceType: Source, completion: @escaping (UIImage) -> Void) {
        self.sourceType = sourceType == .camera ? .camera : .photoLibrary
        self.completion = completion
    }

    func makeCoordinator() -> Coordinator { Coordinator(self) }

    func makeUIViewController(context: Context) -> UIImagePickerController {
        let picker = UIImagePickerController()
        picker.delegate = context.coordinator
        picker.sourceType = sourceType
        return picker
    }

    func updateUIViewController(_ uiVC: UIImagePickerController, context: Context) {}

    class Coordinator: NSObject, UINavigationControllerDelegate, UIImagePickerControllerDelegate {
        let parent: ImagePicker
        init(_ parent: ImagePicker) { self.parent = parent }

        func imagePickerController(_ picker: UIImagePickerController,
            didFinishPickingMediaWithInfo info: [UIImagePickerController.InfoKey : Any]) {
            picker.dismiss(animated: true)
            if let img = info[.originalImage] as? UIImage {
                parent.completion(img)
            }
        }
        func imagePickerControllerDidCancel(_ picker: UIImagePickerController) {
            picker.dismiss(animated: true)
        }
    }
}
#Preview {
    ContentView()
}

Check out the GitHub for the README and a walkthrough of how to run the above code:

https://github.com/Swift-Coding-Club-UCD/SneakyPeaky