This article proposes Dyamond, a one transistor, one capacitor (1T1C) dynamic random access memory (DRAM) in-memory computing (IMC) accelerator with architecture-to-circuit-level optimizations for high memory density and energy efficiency. The bit column addition (BCA) dataflow introduces output bit-wise accumulation to exploit varying accuracy and energy characteristics across different bit positions. The lower BCA (LBCA) reduces analog-to-digital converter (ADC) operations to enhance energy efficiency with inter-column analog accumulation. The higher BCA (HBCA) improves accuracy through signal enhancement and minimizes energy consumption per ADC readout with signal shift (SS). The design maximizes memory density by dedicating 1T1C cells solely to memory and integrating a compact computation circuit adjacent to the bitline sense amplifier. The memory access power is further reduced with a big-little array structure and a switchable sense amplifier (SWSA), which trades off retention time and energy consumption. Fabricated in 28-nm CMOS, Dyamond integrates 3.54-MB DRAM in a 6.48-mm2 area, achieving 27.2 TOPS/W peak efficiency and outstanding performance in advanced models such as BERT and GPT-2.