Researchers at Carnegie Mellon University, in collaboration with the Pittsburgh Supercomputing Center (PSC), have developed a fine-tuning method called Self-Contrastive Fine-Tuning (SCoFT) to enhance AI image generation. This initiative aims to address the issues of cultural misrepresentation and offensiveness in AI-generated images, especially for cultures often overlooked in internet data.
The Problem with AI Image Generators
AI image generators, like Stable Diffusion, often produce results that can be misleading, inappropriate, or culturally insensitive. This issue is particularly acute for underrepresented cultures, as the data these models are trained on predominantly comes from Western sources. For example, when tasked with generating an image of a street in Nigeria, the output may evoke negative stereotypes rather than a realistic depiction.
The SCoFT Approach
To combat this, the CMU team created the Cross-Cultural Understanding Benchmark (CCUB) dataset, which comprises curated text-to-image pairs from five different cultures. This dataset, despite its limited size, serves as a foundation for retraining the AI models to improve their accuracy and sensitivity towards cultural representation.
SCoFT operates by incorporating a combination of conventional fine-tuning techniques and a unique Self-Contrastive Perceptual Loss mechanism. This method allows the AI to evaluate and adjust its generated images in relation to both culturally relevant and non-fine-tuned examples, thus refining its outputs.
The Role of PSC's Bridges-2 Supercomputer
Utilizing PSC's Bridges-2 system was crucial for the project. The supercomputer's powerful GPUs facilitated parallel processing and efficient data handling, enabling the team to conduct multiple experiments and refine their models effectively. This capability was essential for training on a large scale and for the complex fine-tuning required by the SCoFT methodology.
Results and Future Directions
Initial findings indicate that SCoFT significantly enhances the perception of image quality among representatives from the cultures involved, improving how well the generated images matched text prompts, represented cultural contexts, and reduced offensiveness.
The team plans to present their findings at the 2024 IEEE/CVF Computer Vision and Pattern Recognition Conference in June. Future applications of SCoFT may extend beyond national cultures to include various communities, such as those represented by individuals with disabilities, enhancing the overall inclusivity of AI-generated content.