This project was created to digitize the data from scanned paper surveys. It contains two steps, preparation and annotation. In the preparation step simple data points like check boxes are automatically evaluated and the data structure gets created. The annotation step then provides a GUI with side by side view for manual review, correction, and completion of the survey data.
The project is set up to be managed with uv (https://docs.astral.sh/uv/) and you can install the dependencies with uv sync.
It is also possible to set up without uv by manually installing the dependencies listed in pyproject.toml.
On Windows some additional setup is necessary to satisfy all dependencies. Please refer to the installation guides for https://pypi.org/project/pdf2image/ and https://pypi.org/project/pytesseract/#installation.
The project has two main python scripts. One to prepare data for the GUI and the other is the annotation GUI itself.
Run prepareData to convert the survey PDF into images, create metadata files and extract survey data. This extraction is needed to parse the data into a format suitable for the GUI.
The script requires at least the experiment's directory. It expects a single PDF file in this directory and, optionally, a json metadata file. Should there be several PDF in the folder, than only the last found PDF is converted. Please be aware that big PDF will need quite some time to be parsed. If there was no metadata.json when first parsing data, this file will be created. Currently there is no automatic detection of the experiment type, so you have to add this information into metadata.json. You should have enough time to do this when the PDF are parsed. Otherwise you'll run into a run time error, but don't worry, just add the info and run again. The images will be kept and you won't need to wait again.
"Experiment": {
"Type": "unknown", #<- define the actual experiment type here
"Name": ""
},The current output folder structure after parsing is shown below. Changes are possible, but please remember to change the respective experiment directories listed in the experiment's metadata.json as well.
mainFolder
|-<date>_<experiment>
| |-metadata.json
| |-originalData.pdf
| |-surveyResponse_<nr>
| | |-page0
| | |-page1
| | |-...
| | |-data.csv <-- parsed experiment proband data in here
| |
| |-surveyResponse_<nr>
| | |...
| |...
The Gui is started from annotateData. This is just a starter though, the GUI's main windows can be found in guiWindows.py and the widgets shown within the Windows can be found in guiWidgets.py.
To detect single lines of printed writing, use ocr() in ocr.py. For detection of handwriting try out recognize_handwriting() in htr.py. htr.py requires additional pip libraries that are currently commented out in the pyproject.toml
The entire code is currently under MIT License. Make sure, that when you include code from Stack Overflow or similar sides to check the license (usually CC-BY-ND and therefore not compatible with MIT). If you add on to the code and are fine with MIT, just add your copyright notice on top of ours in the license file and indicate the new authors on the sides you changed.
I clicked "Done" by accident
Although this cannot be reversed within the GUI, you can search for the specific survey in the metadata.json file of the experiment and change the status back to "Open".
How do I change a comment?
A comment can be changed within the GUI by going back to the specific survey and going to Report weird things and go back again. Here the old comment should show. Deleting it and clicking Done should save the empty comment.
Alternatively you could go to the specific survey's data.json (surveyData.json) and delete the comment there.
- Status currently only in overview metadata of experiment. Could be copied into surveyData.json if wanted. Not suitable to only keep it there though, as sorting all surveyData.json files is much slower than just sorting the single metadata.json.
- Currently GUI only handles a single experiment. There could be another mainWindow which then opens into the current Main Window and shows either "Open" or "Done" status, depending on what is left (e.g, if nr_open_survey >1 ? Open : Done). How to handle "Needs checking" then?
- Currently the user/data curator has to add the experiment type. This could be detected automatically by reading the first/second/... question (potentially with ocr?)