[{"content":"My father wrote a poem and called it 참회(懺悔) — \u0026ldquo;Repentance.\u0026rdquo; It is long, and it does not read in a single breath. The poem is in Korean. I copy it here in its original form, and then I write down a short reading after it.\n참회(懺悔)\n술을 먹으면 알딸딸하게 좋다가도\n새벽에 악몽으로 깨고 나면\n왜 이리 심장이 끊어질 듯 아려오는가?\n저 깊고도 아득한 기억의 저편\n음습한 혼돈의 바다 속\n잠자던 악령이 깨어나\n사지를 쇠사슬로 묶고\n펄떡이는 심장을 도려낸다.\n도대체 왜?\n한편의 일그러진 분노와\n또 한편의 형언할 수 없는 불안\n뜯겨나간 허탈한 자리에\n불현듯 부끄러움이 밀려든다.\n중학교 때였던가?\n참 마음이 고왔던 아이\n소아마비로 다리를 절었던 아이\n그 누구보다도 당당했던 아이\n서로 마주보다가 묘한 감정에\n서로를 밀어냈던 애틋한 아이\n난 쬐끔, 넌 악성 소아마비\n병신새끼들! 육갑떨고 자빠졌네.\n짝패들의 시선과 조롱에 움츠러들며\n너를 홀로 화살받이로 세워놓고는\n숨더니 외면하고 도망치는 나.\n백옥의 얼굴이 촛농처럼 흘러내리며\n점점 악다구니로 변해가는 너.\n\u0026lsquo;미안해\u0026rsquo; 말도 못하고 피해 다니는\n\u0026ldquo;비겁한 새끼\u0026rdquo;\n나의 또 다른 이름\n고등학교 ?학년 때였지.\n\u0026ldquo;잘난 체 하지 마!\u0026rdquo;\n짝궁이 하는 말에\n대뜸 인정사정없이\n그의 여린 뺨에 주먹을 날린다.\n뭐 대단치도 않은 말에\n의기양양 거들먹거리며\n왜 그리 거만하게 굴었는지?\n\u0026ldquo;못난 새끼\u0026rdquo;\n또 다른 나의 악령\n그 모습을 지켜보다 못한\n절친의 \u0026ldquo;너무 하네!\u0026rdquo;\n한마디에 절교를 한\n\u0026ldquo;옹졸한 새끼\u0026rdquo;\n또 하나의 악업이 추가된다.\n언제였던가?\n육중한 체구의 그들은\n칠성판에 뉘어 놓고 사지를 비틀며\n나의 부끄러운 심장을 해부한다.\n나는 수백 번의 진술서를 써내려가며\n\u0026ldquo;비겁한 새끼\u0026rdquo;,\n\u0026ldquo;못난 새끼\u0026rdquo;,\n\u0026ldquo;졸렬한 새끼\u0026quot;임을 자인한다.\n신념과 의지는 한낱 공염불이었음을\n수십 번의 반성문으로 또 다시 입증한다.\n나의 육신과 영혼은 너덜너덜 찢긴 채\n쇠사슬에서 풀려나 허공 속으로 흩어진다.\n이따금 미몽을 헤맬 때마다\n뫼비우스 띠처럼\n한편의 광기와\n반대급부의 수치심이\n당혹스럽게 외길에서 만나\n과거와 현재가 얼굴을 마주한다.\n그들이 나의 심장을 도륙하듯\n나 또한 너의 심장을 도려냈구나!\n그들이 나의 영혼을 난도질하듯\n나 또한 너의 영혼을 산산이 부숴났구나!\n그렇게 잊혀진 세상은!\n현란한 시니피앙의 깃발에 난무하는\n시니피에의 지루박 댄스 장\n시니컬한 포스트모더니즘 장단에\n초조한 욕망들만이 아우성친다.\n기껏 살아낸 삶의 족적들, 나의 새끼들!\n더러운 새끼, 치사한 새끼, 야비한 새끼\n오만한 새끼, 응큼한 새끼, 추잡한 새끼\n편협한 새끼, 냉담한 새끼, 비열한 새끼\n교활한 새끼, 악랄한 새끼, 역겨운 새끼…\n무수한 새끼들이 옷깃을 여미며\n버젓이 음흉한 거리를 활보하는데…\n인생은 꼬일 대로 꼬여버린 새끼줄\n새빨간 위선에 포박당한 오랏줄.\n황혼 녘 시뻘건 구름이 취한 듯\n상처받은 영혼들이 비틀거린다.\n심장과 영혼들이 녹아내리며\n붉게 물든 얼굴이 일그러진다.\n질척대는 욕망의 잔재를 태우고\n내안의 순수를 깨울 수 있다면\n영혼을 정화하고 위로하기위해\n너와 나, 우리의 희망을 위해\n새로이 피어나는 여명을 위해…\n슬픈 대지를 품에 안은 붉은 노을\n스러질 듯 몸부림치는 불길 속을\n부나방이 되어 걷고 또 걸어간다.\n몽환적인 핑크보랏빛 그라데이션에\n취해 흑갈색 침묵 속으로 사라지더라도…\nI sat still for a long time after I finished the poem. The first time through, the words were too strong; I could not keep pace with them. On the second reading I stopped at this line: 그들이 나의 심장을 도륙하듯 / 나 또한 너의 심장을 도려냈구나 — they tore my heart out, and I in turn cut out yours. At the place where my father lies bound on the wooden plank, his body twisted, the face of the polio-stricken friend he had once run from rises again in the poem. The recognition that received pain reflects given pain runs through the whole poem.\nReading through the \u0026ldquo;새끼들\u0026rdquo; my father is ashamed of — 비겁한 새끼, 옹졸한 새끼, 졸렬한 새끼 — line by line, a different thought came to me. The names he could not bring himself to say for so long, in the poem he was finally calling out one by one. That he had carried this shame inside for so long and at last pulled it out into a poem — that itself felt like a relief. Between those words I could see a person who, in the end, lifts shame out into writing. The selves he fears do not stand with such terrifying faces to me. Reading the poem, I felt as if I were meeting his shame for the first time.\nI stayed with the last stanza for a long time. A moth walks into the red dusk, into the burning sky, again and again. The poem says it disappears into the silence, but to me that part did not read like a disappearance. The poem had ended, yet the moth\u0026rsquo;s footsteps were still going somewhere — as though my father had not finished his repentance in a single poem. I read it to the end, and then I read it once more.\n","permalink":"https://wid-blog.github.io/en/posts/daily/notes/fathers-repentance/","summary":"My father wrote a poem in Korean and called it 참회(懺悔), \u0026lsquo;Repentance.\u0026rsquo; The full poem, followed by my short response.","title":"Reading My Father's Poem '참회'"},{"content":"I am building an automated trading system on my own. Rust + Python + React, with quotes, backtesting, automated orders, and signal trading all in place.\nLooking through my PR history, the work has been BackgroundTask supervisor, pre-condition guard, DeploymentStatus::Abandoned transition, stuck monitor, LocalSim prices — one safety lock after another for a system I haven\u0026rsquo;t even turned on. At first this felt off. Why build the off switch instead of more features?\nAutomated trading is just like that. Mistakes move real money, no one is around to wake me up at night, and there is no real-data verification yet. With those three stacked, \u0026ldquo;halt safely\u0026rdquo; came before \u0026ldquo;run well,\u0026rdquo; and I started building from there.\nETF rebalancing and automated portfolio management What I wanted to build was an asset management system for securities. A system that defines a portfolio with a mathematical formula and adjusts weights to match, automatically, without any human in the loop.\nThe starting point was ETF-based rebalancing. ETFs come with diversification built into a single symbol, so the decision cost of picking stocks stays low, and the trading frequency is low enough that costs stay manageable. That made them a fit for the first stage of automation at the allocation level. The flow is: validate an allocation strategy with backtesting, deploy the validated strategy to an account, and let it adjust weights on a schedule.\nNext came the single-stock level. Factor screening narrows down the candidates, backtesting validates the result, and signal trading automates entry and exit. The decision unit is finer than allocation. The underlying flow is the same — define something mathematically and let it run on its own.\nRust + Python split and broker abstraction The initial code skeleton had two pieces. The Rust + Python split and the broker abstraction. Neither was made for safety.\nThe Rust + Python split was a workload separation.\nThe Rust server handles the real-order flow — order placement, balance sync, signal evaluation. I picked Rust because the operating environment (Oracle Cloud ARM Always Free) is resource-constrained, so a small memory footprint helped, and the latency of signal trading felt safer without GC pauses. Wanting to learn Rust was part of the motivation too.\nStateless computation like backtesting and factor screening lives in the Python quant engine, with no DB dependency. vectorbt covers backtesting well, so there was no reason to reimplement it in Rust.\nThe broker abstraction puts KIS and LocalSim behind the same trait. The order executor takes a trait, and the real KIS implementation and the LocalSim implementation satisfy the same interface. LocalSim started out as a way to develop in environments without KIS access.\nLooking back before launch, the two of them had become the first layer of safety. The Rust + Python split divided the blast radii of two processes, and the broker abstraction made every automated flow runnable as a dry-run on LocalSim.\nHalt, block, detect, and simulate are the safety layers I deliberately added to the automated trading flow.\nBackgroundTask supervisor and abandoned transition Halt was the first safety layer I added.\nThe first piece was the BackgroundTask supervisor. Long-running background tasks — the signal engine, risk monitor, US market poller — can die from a panic, and I thought the dangerous state was \u0026ldquo;dead, and nobody knows.\u0026rdquo; The supervisor detects the panic and restarts or permanently halts based on the policy. Defining restart policy as a trait + enum let me plug a different policy into each task, which felt clean.\nNext, I added the DeploymentStatus::Abandoned transition. Originally, if liquidation didn\u0026rsquo;t finish, the operator had to stop it manually, but leaving something stuck for days felt risky. I made the deployment transition to Abandoned when liquidation shows no progress for a defined period. In Abandoned, automated trading halts and no further orders are issued.\nThe force-abandon API came in the same wave. As a backup for when the automatic transition doesn\u0026rsquo;t fire, I added a way to stop everything from the outside with one call. The goal was to make sure no state could ever become \u0026ldquo;impossible to stop.\u0026rdquo;\nThe last piece was the scheduler\u0026rsquo;s automatic abandoned transition. At a fixed time each day, the scheduler checks liquidation states and marks abandoned anything that meets the criteria. Even without daily manual checks, abnormal states don\u0026rsquo;t pile up.\npre-condition guard and chat_id gate Block came next. If halt is a lock during operation, block is a lock at the entrance.\nThe first was the deployment pre-condition guard. With no broker credentials registered, both deployment activation and liquidation are refused. Originally, activation fetched credentials mid-flow and failed when they were missing, but the failure came too late. I moved the check just before activation, so a missing credential gets rejected at that point.\nThe Telegram chat_id onboarding gate was a similar decision. Once automated trading starts, fill notifications, signals, and reconciliation alerts need to arrive immediately, but if a deployment activated without an alert channel set up, there was no way to know when an issue arose. So I added a gate that throws a strong warning when a user without a registered chat_id tries to create a deployment.\nBoth the pre-condition guard and the chat_id gate started at activation time and gradually moved earlier. Blocking late lets bad state seep deeper in, and the cost of rolling it back gets bigger.\nstuck monitor and balance reconciliation For halt and block to work, the abnormal state has to be visible first. Detection was the next batch of work.\nThe LiquidationStuckMonitorTask came first. It periodically checks the progress of liquidations in flight, and if nothing changes for a defined period, it fires a Sentry alert. It pairs with the automatic abandoned transition — the signal that \u0026ldquo;it\u0026rsquo;s time to auto-halt\u0026rdquo; becomes visible to a human.\nBalance reconciliation was a different shape. The real KIS balance is compared against the strategy_position ledger I maintain, and a Telegram notification fires on mismatch. It\u0026rsquo;s a check against the broker\u0026rsquo;s view of truth to see whether automated trading reflects exactly what was intended. When drift shows up, the reconcile API corrects it.\nLast, I cleaned up the alert channels. I split Sentry and Telegram into separate Trading and System channels — real-order flow events go to Trading, and operational/system anomalies go to System. With solo operation, alerts buried in noise make detection itself meaningless, and that was the motivation.\nLocalSim and dev seed Simulation was last. I wanted to be able to run the whole circuit before any real capital moved.\nThe LocalSim price stream was the starting point. Originally I faked prices with a monotonic function. That was enough when I only needed to confirm the circuit ran, but when I wanted to verify signal evaluation and rebalancing against varied patterns, it wasn\u0026rsquo;t enough. So I swapped the model to GBM. Giving each symbol a different drift and volatility produces a stream where rises, falls, and flat periods mix.\nThe dev trading_credentials seed was another tool I built for simulation. Running the real-order flow locally needs credentials, and injecting them manually each time was tedious. I wrapped an idempotent UPSERT seed command into a Makefile target. Since reproducing the dev environment became one line, I started running checks more often.\nWithout a simulator, the first real order becomes the first integration test. In automated trading, I think that\u0026rsquo;s the most expensive kind of test. Only after seeing a clean run on LocalSim for days do I think I\u0026rsquo;ll have the confidence to move to real orders.\nRetrospective Building the off switch for a system that hasn\u0026rsquo;t been turned on felt odd at first, but after going through it, I think it was the most reasonable order of work for the combination of automated trading + solo + pre-launch. Building safety locks before more features cuts down on the shakiness at the moment of going live.\nThere\u0026rsquo;s still a way to go before turning it on. I want to see the full automated flow run cleanly on LocalSim for a meaningful period. Moving to real orders without confirming that signal generation, order placement, fill, balance sync, and liquidation all cycle through cleanly is risky. Each safety layer also has to be broken at the circuit level once — does the supervisor actually catch panics, does the guard actually refuse bad activations, does the stuck monitor actually fire its alert?\nThe first real order is itself a safety boundary. I plan to verify with a small position for a period, then scale up in stages. Going to full size in one step would go against everything I built.\nThe next retrospective will be written after going live.\nReferences Backtest Performance Metrics — Metrics used to validate both allocation and single-stock strategies in backtesting Portfolio Management and Factor Scoring — Background for the single-stock-level factor screening Technical Indicators and Trading Signals — Technical indicators used in signal trading Korean Account Types and Investment Constraints — Korean market constraints behind the KIS integration ","permalink":"https://wid-blog.github.io/en/posts/career/personal/quant-investment-platform-mid-retrospective/","summary":"A mid-project retrospective on a personal automated trading platform built with Rust + Python + React. With ETF rebalancing and single-stock signal trading both in place, a record of how the safety layers — halt, block, detect, simulate — got built before going live.","title":"quant-investment-platform — mid-retrospective"},{"content":"An earlier post noted overfitting as one of five backtest pitfalls but deferred the question of how to verify it quantitatively. The companion post on efficient frontier optimization also leaves an open thread — input estimates for μ and Σ dominate the result, so out-of-sample stability needs to be confirmed.\nWalk-forward analysis slides train-test windows through time and exposes the difference between in-sample (IS) and out-of-sample (OOS) performance, along with parameter stability.\nThe Trap of a Single Backtest When parameters are searched on the same data, IS performance almost always improves. Try enough signal candidates, lookbacks, and thresholds, and some combination will fit well by chance. This is data snooping. Putting that result into live trading often leads to collapse on OOS data.\nThe natural fix is a train/test split — slice the data once chronologically, fit parameters on the train portion, measure on the test portion. The trouble is that time-series data cannot be shuffled randomly, since future information must not leak (look-ahead bias). A single chronological cut leaves the result entirely dependent on whether that one cut point happened to fall in a favorable or unfavorable regime. The OOS result from a single split has high variance and is timing-dependent.\nThe Structure of Walk-Forward Walk-forward produces many train-test pairs. As windows slide through time, each fold uses train to set parameters and test to measure performance.\n|---- train ----|-test-| |---- train ----|-test-| |---- train ----|-test-| |---- train ----|-test-| The same dataset yields multiple OOS samples, and the distribution across folds tells you how stable the strategy is.\nAnchored fixes the start of the train window and extends only the end. The train window grows over time, using more past information as folds progress. The implicit assumption is that all past information remains valid in the future.\nRolling keeps a fixed train window size and slides it forward. Only the most recent N years are ever used, which lets the model adapt to regime change. The assumption is that older information turns into noise.\nWhich one fits depends on the asset class and market structure. Equity indices often use anchored; assets sensitive to macro regimes or derivative strategies often use rolling.\nWhat to Measure Aggregate the OOS results across folds and look at the distribution.\nOOS performance per fold — CAGR, Sharpe Ratio, MDD per fold Performance degradation — gap between IS Sharpe and OOS Sharpe. IS 1.5 paired with OOS 0.3 means a 1.2 gap and severe overfitting. IS 1.0 paired with OOS 0.8 is closer to a stable strategy. Parameter stability — whether the optimal parameters chosen per fold are similar. If a different lookback wins every fold, the signal itself is unstable. OOS distribution — not just the mean but the per-fold variance and the worst-fold loss. A solid average can still hide a -40% worst fold that would be unmanageable in practice. The point is the distribution, not a single number. A strategy whose OOS Sharpe clusters tightly around 0.5 across folds is often more trustworthy than one that averages 1.0 but swings wildly fold to fold.\nCase: Tuning Momentum Lookback Suppose the candidate lookbacks for a momentum strategy are 3, 6, and 12 months. A single backtest over the full period might show that 6 months gives the highest average Sharpe Ratio. The conclusion becomes \u0026ldquo;6 months wins\u0026rdquo;.\nWalk-forward on the same data tells a different story. With 10 folds, the question shifts to how the winning lookback is distributed across folds.\nIf 6 months wins 7 out of 10 folds, the signal is robust If the distribution is 4-4-2 across 3, 6, and 12 months, there is no robust signal in the lookback choice If 12 months wins the first 5 folds and 3 months wins the last 5, that pattern indicates a regime change The same backtest data yields a single decision under a single backtest (\u0026ldquo;6 months wins\u0026rdquo;) but a distribution of decisions plus their stability under walk-forward.\nLimits Walk-forward is not a universal validator either.\nCompute cost — folds × parameter grid explodes quickly. 10 folds with 10 lookback candidates means 100 backtests. Short time series — fewer than 10 years of data leaves too few folds. Five years split into 5 folds gives short train windows, and parameter estimates wobble. Survivorship bias persists — walk-forward only handles the time split. If delisted names are missing from the data, every fold inherits the same bias. Regime change is hard to distinguish — if train and test regimes differ, weak OOS performance might be overfitting or might be a regime shift, and the two are hard to separate. True OOS is still the future — OOS within a fixed dataset is ultimately a retrospective split. Truly new OOS data only comes from live operation. A more sophisticated approach is Combinatorial Purged Cross-Validation, proposed by López de Prado. Instead of chronological folds, it forms many train-test combinations while purging boundary leakage. Statistical power is higher, but so are implementation complexity and compute cost.\nWalk-forward analysis quantifies the reliability of a backtest. Only when the IS-OOS difference is small and the chosen parameters are stable across folds can a strategy be called \u0026ldquo;not just luck\u0026rdquo;.\nThe same kind of check applies to the efficient frontier optimization covered in the companion post. The Markowitz model is sensitive to estimates of μ and Σ, so re-estimating those inputs at each fold and observing the resulting weights and OOS performance is a natural extension. If the inputs swing across folds, the weights and OOS results will swing too.\nReferences Investopedia — Walk-Forward Optimization Investopedia — Overfitting vectorbt — Portfolio Optimization Marcos López de Prado, Advances in Financial Machine Learning (2018), Ch.7 \u0026ldquo;Cross-Validation in Finance\u0026rdquo; Bailey, D., López de Prado, M. (2014). \u0026ldquo;The Probability of Backtest Overfitting\u0026rdquo; ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/walk-forward-validation/","summary":"The structure of walk-forward analysis, the metrics it produces (IS-OOS gap, parameter stability), a momentum-lookback tuning case, and the limits that keep it from being a universal validator.","title":"Walk-Forward Analysis and Overfitting Validation"},{"content":"The earlier posts dealt with the math of the strategies themselves. Backtest metrics, the efficient frontier, walk-forward — all answering \u0026ldquo;what weights, when, on which names\u0026rdquo;. For a Korean retail investor moving a strategy to live operation, the first decision sits one layer above. Which account holds it?\nBuying the same ETF in a general account versus an ISA leads to different after-tax outcomes. Some strategies cannot run in certain accounts at all. Account constraints are an input to strategy design, not an afterthought.\nGeneral Account The most flexible account. Domestic stocks, foreign stocks, ETFs, funds, and bonds can all be traded.\nThe trade-off for flexibility is that there is no tax-free quota. Financial income (interest and dividends) above KRW 20 million per year falls into comprehensive income taxation. Capital gains on Korea-listed stocks are tax-free, but foreign stocks are taxed at 22% capital gains tax after a KRW 2.5 million annual deduction.\nStrategies that require direct foreign investment — SPY, QQQ, VGK and the like — can only run here.\nISA The Individual Savings Account allows only Korea-listed products. Foreign ETFs and foreign stocks are not allowed. Korea-listed ETFs that track foreign indices, like TIGER S\u0026amp;P 500, are still tradable.\nThe tax benefit defines the account. The standard form is tax-free on up to KRW 2 million of net gains (KRW 4 million for the \u0026ldquo;low-income\u0026rdquo; variant with annual salary below KRW 50 million), with anything above that taxed at a flat 9.9% on withdrawal. Annual contribution is capped at KRW 20 million, up to KRW 100 million cumulative, with unused capacity rolled forward.\nThe mandatory holding period is three years. Rolling the maturity proceeds into a pension account unlocks additional tax breaks. The account fits domestic ETF allocations and Korean single-stock factor strategies.\nPension Savings Account Only Korea-listed ETFs and funds can be traded. Individual stocks are not allowed.\nThe motivation is the income tax credit. Contributions up to KRW 6 million per year qualify for a credit of 16.5% (for annual salary at or below KRW 55 million) or 13.2% (above). A KRW 6 million contribution returns up to KRW 990,000 in taxes.\nWithdrawals are restricted to age 55 and above, with a 3.3% to 5.5% pension income tax at distribution. Early withdrawal triggers a 16.5% miscellaneous income tax, which effectively erases the tax benefit. The structure enforces long-term holding.\nIt suits conservative allocations — all-weather, 60/40, three-bucket — held over a long horizon.\nIRP The Individual Retirement Pension, like pension savings, allows only Korea-listed ETFs and funds. The decisive difference is the 70% risk-asset cap.\nEquity-type assets cannot exceed 70% of the account balance. The remaining 30% must be filled with safe assets — bond funds, MMFs, or deposits. A 100% equity strategy cannot run inside IRP.\nThe tax credit shares a combined annual cap of KRW 9 million with the pension savings account. A common split is KRW 6 million in pension savings plus KRW 3 million in IRP. Withdrawal and early-exit rules match the pension savings account.\nThe 70% rule looks like a constraint but also nudges naturally toward diversified allocations. A 60/40 portfolio fits IRP without modification, and running closer to 70/30 against the cap is a frequent setup.\nStrategy-to-Account Mapping Mapping strategy types against account feasibility produces the following table.\nStrategy General ISA Pension IRP Direct foreign (SPY, QQQ) ✅ ❌ ❌ ❌ Korea-listed foreign-index ETF ✅ ✅ ✅ ✅ Korean single stocks ✅ ✅ ❌ ❌ Domestic ETF allocation ✅ ✅ ✅ ✅ 60/40 allocation ✅ ✅ ✅ ✅ 100% equity allocation ✅ ✅ ✅ ❌ The trade-off between tax benefits and trading freedom is explicit. The general account has the most freedom and the fewest tax benefits; ISA, pension, and IRP carry strong tax benefits but limit which products and holding periods are allowed.\nIn practice, running different strategy buckets in different accounts is a common pattern.\nGeneral: direct foreign positions (SPY, QQQ), short-term trading ISA: Korean single-stock factor strategies, using the KRW 2 million tax-free quota Pension savings: conservative allocations through Korea-listed foreign-index ETFs IRP: 60/40 allocations that align naturally with the 70% rule After-Tax Impact Suppose KRW 100 million is invested in KODEX 200 and held for five years at an annual return of 5%. The final value is roughly KRW 127.6 million, with about KRW 27.6 million in gains.\nIn a general account, capital gains on Korea-listed ETFs are tax-free, while distributions incur a 15.4% dividend withholding tax. In an ISA, the same gains plus distributions net against each other, the first KRW 2 million is tax-free, and anything above is taxed at a flat 9.9% — clearly better than the general account on an after-tax basis. In pension and IRP accounts, taxation is deferred during accumulation and applied as pension income tax on withdrawal, sidestepping the progressive comprehensive-income rate.\nExact figures depend on distribution yield, other financial income, and enrollment timing. The numbers above only indicate direction. Precise tax calculations sit in expert territory.\nCaveats Tax law changes every year. This post reflects the rules as of May 2026; credit rates, contribution caps, and tax-free thresholds shift regularly through legislative revisions. Re-checking the latest rules just before any decision is the safer move.\nEligibility also depends on individual circumstances. Annual salary bracket, prior account history, employer-sponsored IRP enrollment, and total financial income all act as variables. This post stays at the level of orientation rather than tax advice.\nAccount type is an input to strategy design. Deciding which strategy goes into which account, then descending into security selection and weights, is the natural order. The same allocation strategy must be modified for IRP under the 70% rule; the same ETF delivers the most after-tax value in an ISA within the tax-free quota.\nThe next post returns to backtesting, expanding the one-line entries on look-ahead bias and survivorship bias into concrete case studies.\nReferences Financial Supervisory Service — Integrated Financial Product Comparison National Tax Service — Pension Account Tax Benefits Investopedia — Tax-Advantaged Accounts Korean brokerage account guides (Korea Investment \u0026amp; Securities, KB Securities, etc.) for the latest list of products tradable in each account ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/korean-account-types/","summary":"The core constraints of the four Korean retail account types — general, ISA, pension savings, and IRP — covering tax-free thresholds, tax deductions, the 70% risk-asset cap, and access to direct foreign investment, plus a strategy-to-account mapping.","title":"Korean Account Types and Investment Constraints"},{"content":"An earlier post in this series listed five backtest pitfalls as one-liners in a table. Two of them — look-ahead bias and survivorship bias — sit apart from the others. They warp results in a single direction, namely the direction that inflates CAGR.\nThese are not two-sided noise but systematic bias, so missing them means carrying the inflation straight into live trading. Walk-forward analysis, covered in the companion post, does not catch them either. Walk-forward only partitions time, so future information leaked into the data, or the survivors-only universe, flows untouched into every fold. The patterns only become clear at the case level.\nLook-ahead Bias This is any situation where a decision at time t uses information from time t+k. Accidental future leakage shows up in subtle forms in backtests.\nCase 1. Full-Period Momentum Normalization Consider a strategy that ranks securities by z-scored momentum and picks the top N. If the z-score at time t uses the mean and standard deviation of the full period, those statistics carry information from t+k. The score at t silently reflects data that was not yet available.\nThe fix is straightforward. Compute the mean and standard deviation on a rolling window that ends at t. Nothing from after t enters the normalization.\nCase 2. Ignoring Financial Disclosure Lag A fiscal year ending 2024-12-31 is typically disclosed in March 2025. Using that statement in a backtest decision dated January 2025 means consuming information that had not yet been published.\nThe fix is to model the disclosure lag explicitly. Sources like DART or SimFin expose both the period end and the actual filing date as separate columns. The backtest only uses the statement after the filing date.\nCase 3. Close-Signal, Close-Fill A common shortcut is to compute a signal on the closing price (moving-average cross, RSI) and assume execution at the same close. In reality, by the time the close has printed, orders can no longer be placed at that price. The realistic fill is the next-day open or VWAP.\nClose-signal-close-fill misses the next-day gap effect entirely, effectively assuming an entry price that is better than what would have happened. The fix is to fill at the next session\u0026rsquo;s open or VWAP, or keep the close fill and add slippage.\nSurvivorship Bias This happens when a backtest runs only on names that survived to today. Delisted names are usually the losers, and removing them inflates CAGR.\nCase 1. S\u0026amp;P 500 Reconstitution Over Time The S\u0026amp;P 500 is not a fixed roster. Somewhere between ten and twenty-odd names are replaced each year. Over a decade, well over a hundred names have come and gone.\n\u0026ldquo;Backtesting on S\u0026amp;P 500 names over 10 years\u0026rdquo; depends on which year\u0026rsquo;s roster is used. Running today\u0026rsquo;s 500 against 10-year-old prices already filters to survivors. The 10-year-old roster should include names that have since been delisted, acquired, or otherwise removed.\nThe fix is to use a point-in-time index composition source — Compustat, CRSP, and similar. Free APIs do not carry that information, so where commercial data is out of reach, the bias has to be acknowledged and results read conservatively.\nCase 2. Limits of Yahoo Finance and Free APIs Yahoo Finance, Korea\u0026rsquo;s KIS API, Naver quotes, and similar free sources rarely keep delisted names. Once a ticker is delisted, the symbol disappears or price queries return nothing.\nThe Korean market behaves the same way. Delisted KOSPI names lose their historical price data in free sources. A \u0026ldquo;10-year KOSPI factor backtest\u0026rdquo; on free data therefore carries built-in survivorship bias.\nThere are two ways out. Buying commercial data is the direct fix; where the budget does not allow it, results need to be read with a correction margin. Academic estimates of CAGR inflation from survivorship bias often cite 2–4% per year.\nAdjacent Pitfalls Two related biases tend to sit next to these.\nData snooping is finding a \u0026ldquo;good\u0026rdquo; combination by trying enough parameter combinations. Ten signal candidates × ten lookbacks × ten thresholds means 1,000 runs, and a few will look strong by pure chance. The companion post on walk-forward analysis partly mitigates this, but does not fully solve it.\nSelection bias is reporting only the periods where the strategy looks good. Showing 2010–2020 and omitting 2008 changes the impression of the same strategy. The statistical strength of the claim drops.\nAvoidance Checklist Run through this list before running a new backtest or when reading existing results.\nDoes the signal calculation avoid using information from after t? Are financial statements lagged to their actual filing dates? Does execution happen at a time different from the signal time (a time when orders can actually be placed)? Is the universe point-in-time, or is today\u0026rsquo;s roster being applied to past data? Are transaction costs and slippage included in the simulation? Some items cannot be resolved fully under free-data constraints. In those cases, marking the limitation explicitly and adding a margin to result interpretation is the next-best option.\nOnce these biases are named, they can be avoided. With free-data setups, survivorship bias in particular is hard to remove completely, and conservative reading of results is the rational stance. Combined with walk-forward analysis from the companion post, the reliability of a backtest gains a second layer. Walk-forward handles the time split; the checklist above handles data integrity.\nReferences Investopedia — Look-Ahead Bias Investopedia — Survivorship Bias Investopedia — Data Snooping Bias Marcos López de Prado, Advances in Financial Machine Learning (2018) Bailey, D., López de Prado, M. (2014). \u0026ldquo;The Probability of Backtest Overfitting\u0026rdquo; ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/backtest-pitfalls-case-study/","summary":"Concrete cases of look-ahead bias and survivorship bias — full-period momentum normalization, financial disclosure lag, close-on-close fills, S\u0026amp;P 500 reconstitution, the limits of free APIs — followed by an avoidance checklist.","title":"Backtest Pitfalls Case Study"},{"content":"An earlier post listed traditional asset-allocation weights like \u0026ldquo;60% equities, 40% bonds\u0026rdquo; in a table but deferred the question of where those numbers come from. The mean-variance model proposed by Markowitz in 1952 reduces the problem of choosing weights across N assets to a mathematical optimization. It finds the weight vector that satisfies \u0026ldquo;maximum return for a given risk\u0026rdquo; or \u0026ldquo;minimum risk for a given return\u0026rdquo;.\nThe model itself is simple. The trouble starts when its inputs — expected returns and the covariance matrix — are estimated, since estimation error then dominates the outcome. This is why practitioners rarely use Markowitz as-is and reach for variants instead.\nThe Math of Weight Selection When there are N assets, the number of possible weight combinations is infinite. With only the constraint that weights sum to one (Σwᵢ = 1), N-1 degrees of freedom remain. Answering \u0026ldquo;which combination is best?\u0026rdquo; requires a definition first.\nMarkowitz\u0026rsquo;s definition starts with two objects.\nExpected return vector μ ∈ ℝᴺ — the expected return of each asset Covariance matrix Σ ∈ ℝᴺˣᴺ — how returns move together across assets Both are typically estimated from historical returns as sample mean and sample covariance. The diagonal of the covariance matrix holds individual variances; the off-diagonal entries capture how pairs of assets co-move.\nGiven a weight vector w, the portfolio\u0026rsquo;s expected return and variance are:\nE[Rₚ] = wᵀμ σₚ² = wᵀΣw Expanding the two-asset case makes the intuition clearer.\nσₚ² = w₁²σ₁² + w₂²σ₂² + 2w₁w₂ρσ₁σ₂ The lower the correlation ρ, the smaller the last term, and the lower the variance. This is the mathematical foundation of diversification. With ρ = -1, two assets can be combined to drive variance to zero with the right weights. Such pairs are rare in practice, but even combining assets with low correlation meaningfully reduces risk.\nThe Efficient Frontier Plotting all possible weight combinations on the (risk σₚ, return E[Rₚ]) plane produces a filled region. The boundary that gathers only the points with the highest return for each level of risk is the efficient frontier.\nPoints on the frontier are weight vectors that \u0026ldquo;cannot do better\u0026rdquo;. Points inside the frontier are inefficient — another weight vector exists that delivers a higher return at the same risk. Analysis then naturally narrows to \u0026ldquo;we will only pick from the frontier\u0026rdquo;.\nTwo Optimal Solutions Picking a single point on the frontier needs yet another definition.\nMinimum-Variance Portfolio min wᵀΣw s.t. Σwᵢ = 1 The objective is to find the weights that minimize variance. The key feature is that the return estimate μ does not appear. The greatest weakness of the Markowitz model is the instability of μ estimation, and the minimum-variance solution sidesteps that risk. Only the covariance needs to be estimated.\nThe trade-off is that return information is discarded, so the result tends to be conservative. Weights naturally concentrate in low-volatility assets.\nTangency Portfolio max (wᵀμ - r_f) / √(wᵀΣw) The objective is the portfolio\u0026rsquo;s Sharpe Ratio. The solution lies at the point where a line drawn from the risk-free rate r_f is tangent to the efficient frontier. Since the Sharpe Ratio is maximized, the chosen weights give the best risk-adjusted return.\nIn an earlier post, Sharpe Ratio was the metric measuring risk-adjusted return for individual strategies. The tangency portfolio transplants that concept into weight selection across assets. The weakness is sensitivity to μ — a small shift in expected returns moves the tangent point sharply, and the weights with it.\nConstraints The theoretical model has only the sum constraint, but practical setups add more.\nwᵢ ≥ 0 — long-only. No short selling. The default assumption for retail accounts. wᵢ ≤ w_max — upper bound on individual asset weight. Limits concentration risk in a single name. Σ_{sector} wᵢ ≤ s_max — sector-level cap. Each added constraint narrows the feasible space. The efficient frontier itself shifts inward, settling at points below the \u0026ldquo;theoretical optimum\u0026rdquo;. Practitioners accept the efficiency loss in exchange for realizability and risk control.\nLibraries like cvxpy and pypfopt expose this class of optimization through a standard interface. Users only need to provide μ, Σ, and the constraints.\nMarkowitz Pitfalls and Practical Adjustments The model\u0026rsquo;s weaknesses lie in input estimation and distributional assumptions.\nExpected-return estimation error is the most damaging. A small deviation in μ sends the optimal weights to extremes. This is why the Markowitz optimizer is sometimes called an \u0026ldquo;estimation error maximizer\u0026rdquo;. The premise of using historical means as forward-looking expectations is fragile.\nCovariance matrix instability is another concern. As the number of assets grows, the sample covariance estimate itself becomes ill-conditioned. This limitation is one motivation behind methods like Hierarchical Risk Parity, discussed below.\nThe normality assumption is another limit. Measuring risk by variance alone cannot capture fat tails or skewness. Extreme events like the 2008 financial crisis lie outside the distribution the model assumes.\nIn practice, several variants are used instead of pure Markowitz.\nEqual-weight(1/N) — same weight on every asset. DeMiguel et al. (2009) reported that 1/N often beats Markowitz variants on out-of-sample performance, simply because there is no estimation error. Risk parity — weights are set so that each asset contributes equally to portfolio risk. No return estimate is required, avoiding the instability of μ. Hierarchical Risk Parity — assets are clustered by correlation and weights are assigned recursively. Proposed by López de Prado, it stays stable even when the sample covariance is ill-conditioned. Black-Litterman — combines market-equilibrium weights as a prior with investor views in a Bayesian framework. The uncertainty in μ becomes an explicit part of the model. The methods differ in which weakness of Markowitz they work around. Equal-weight removes estimation error entirely, risk parity and HRP avoid estimating μ, and Black-Litterman makes the uncertainty in μ an explicit part of the model.\nThe efficient frontier is a starting point for portfolio weight selection, not an endpoint. Since estimation error in the inputs dominates the outcome, a simple strategy like 1/N often beats Markowitz on out-of-sample data — a well-documented paradox.\nThis limitation leads directly to the next question: how do we verify that weights that look good in a backtest will hold up in live operation? The next post covers walk-forward analysis as a way to quantify the reliability of backtest results.\nReferences Investopedia — Modern Portfolio Theory (MPT) Investopedia — Efficient Frontier pypfopt — Efficient Frontier Markowitz, H. (1952). \u0026ldquo;Portfolio Selection\u0026rdquo;. Journal of Finance 7(1) DeMiguel, V., Garlappi, L., Uppal, R. (2009). \u0026ldquo;Optimal Versus Naive Diversification\u0026rdquo;. Review of Financial Studies 22(5) López de Prado, M. (2016). \u0026ldquo;Building Diversified Portfolios that Outperform Out of Sample\u0026rdquo; ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/efficient-frontier-optimization/","summary":"Covers Markowitz\u0026rsquo;s mean-variance model as the mathematical foundation for setting asset weights, the two optimal points on the efficient frontier (Min-Variance and Tangency), and the practical adjustments that compensate for the model\u0026rsquo;s weaknesses.","title":"Efficient Frontier and Portfolio Optimization"},{"content":"When a monolith grows, microservices come to mind. Builds slow down, a change in one part blocks deployment of another, and traffic to one place propagates through the whole. The conventional prescription is \u0026ldquo;make it smaller.\u0026rdquo; The harder problem is deciding where to cut.\nMicroservices is a decision about which criterion to use to decompose the system. Domain boundary, data ownership, scale pattern, failure isolation — whichever you anchor to creates the service boundaries, and those boundaries decide communication, data consistency, and operational cost in turn. Pick the wrong criterion and downstream decisions go off, and the boundary, once drawn, is hard to undo.\nflowchart LR A[Decomposition CriteriaDomain · Data · Scale · Failure] --\u003e B[Service Boundary] B --\u003e C[CommunicationSync / Async] B --\u003e D[Data Consistency] C --\u003e E[Operational Cost] D --\u003e E Decomposition Criteria Splitting a service is a question of which difference becomes the boundary. The same code shaped by domain looks one way, by data ownership another, by scale pattern yet another. There is no single right criterion. The question is which criterion is dominant in this system.\nDomain Boundary DDD\u0026rsquo;s Bounded Context aligns directly. Business-meaning units like \u0026ldquo;Order,\u0026rdquo; \u0026ldquo;Payment,\u0026rdquo; and \u0026ldquo;Recommendation\u0026rdquo; align with the service. Change cohesion is good. When order logic changes, only the order service changes.\nThe weakness shows when the domain is unclear. Draw boundaries before the model has settled and the wrong model becomes the service boundary. One domain ends up split across two services, or two domains fuse into one. From then on, most changes touch both services.\nThis is the vertical-split decision inside a single codebase, extended to the service boundary.\nData Ownership Who owns the write authority over which tables. The moment you allow a shared DB, you lose the core benefit of MSA: independent deployment and independent schema evolution. A schema change in one service can break another service\u0026rsquo;s code.\nThis criterion often coincides with the domain boundary. A domain owns its data as a matter of course. But within the same domain, when write patterns differ, data ownership becomes a separate criterion. Within an order domain, transactional order writes and analytics aggregates carry different load and consistency demands.\nScale Pattern Split by workload characteristics. Within the same domain, splits fit when read-vs-write, CPU-vs-IO, or bursty-vs-steady patterns diverge.\nTake a chat workload: message publishing is write-heavy and bursty, while message search is read-heavy and IO-bound. Bundle them into one service and tuning for either pattern leaves the other inefficient. Split them and each scales the way that suits it. Publishing is absorbed by a queue, and search is served by index plus cache.\nFailure Boundary Split so a failure in one service can\u0026rsquo;t reach another. Separate the critical path from the non-critical path.\nTake an ad-serving workload: when the main recommender fails, fallback content has to keep flowing or revenue takes a hit. Put both in one service and the main\u0026rsquo;s failure halts the fallback as well. Split them and the fallback survives on its own path. This is the decomposition criterion from the stability perspective.\nWhen Criteria Conflict The domain boundary often wants one split while the scale pattern wants another. When a single domain holds both bursty and steady workloads, domain unity and scale separation collide.\nPick the dominant criterion for this system, split along it, and handle the remaining criteria through internal modules or queues within a service. Pulling every criterion up to the service boundary explodes service count and operational cost beyond control.\nService Communication Once service boundaries are set, the next decision is how they talk. Synchronous or asynchronous.\nSynchronous — gRPC When the call is imperative and needs an immediate response. Payment requests, auth checks. The caller waits for the result and learns of failure right away.\ngRPC defines bidirectional contracts via ProtoBuf, an IDL, over HTTP/2. It supports four modes: unary, server streaming, client streaming, and bidirectional streaming. ProtoBuf\u0026rsquo;s binary serialization is lighter than JSON. HTTP/2\u0026rsquo;s multiplexing resolves the head-of-line blocking that limited HTTP/1.1.\nThe cost of sync is cumulative latency along long call chains, and the fact that one failure propagates through the chain. Sync communication is best kept to short chains and confined to the critical path.\nAsynchronous — Kafka When the system needs event publication, eventual consistency, or traffic absorption. An order is created and the recommendation service consumes the event to update its model; user activity logs accumulate in a queue and an analytics service processes them at its own pace.\nKafka is a distributed log. A producer writes events to a topic, and consumers read from their own offsets. Multiple consumers can read the same event for different purposes (fan-out). Bursts get absorbed by the queue, flattening the load downstream.\nThe cost of async is consistency that is not immediate. A brief gap exists before the event arrives, and lag accumulates if a consumer halts or slows.\nWhich Path The decomposition criterion answers it.\nDomain boundary split, but two domains depend on each other\u0026rsquo;s immediate result → sync. Data ownership split where one service\u0026rsquo;s change must update another\u0026rsquo;s cache or view → async events. Scale pattern split with bursts to absorb → async queue. Failure isolation separating critical and non-critical → sync on the critical path, async on the non-critical path to enable fallback. Most real systems mix both. A system that insists on a single communication style is usually reading only one decomposition criterion.\nData Consistency and Operational Tools Loss of the Single Transaction The single ACID transaction familiar from the monolith is gone. \u0026ldquo;Wrap payment and inventory deduction in one transaction\u0026rdquo; is not natural across service boundaries.\nDistributed transactions take two directions. Attempt them synchronously (2PC, TCC), or accept eventual consistency and design compensating transactions (Saga, Outbox). Neither restores the simplicity of the single transaction.\nWhen drawing service boundaries, be aware of where the single transaction breaks. Splitting the most-frequently-co-changing data across boundaries turns every write into a distributed transaction, and the cost is hard to dismiss. The decomposition criterion decides not just communication but the data-consistency model as well.\nWhere Operational Tools Become Necessary As service count grows, new tools become necessary in specific places.\nService Mesh: when communication policy (retry, timeout, circuit breaker, mTLS) needs to live outside application code API Gateway / BFF: when auth, rate limiting, and response composition belong concentrated at the external entry point Distributed tracing: when call chains grow long enough that locating slowness in a request becomes hard Container orchestration: when service count grows enough to require automated deployment and scaling These tools follow as consequences, not as prerequisites. With clear decomposition criteria and simple boundaries, tool adoption can be deferred. Adopt tools first while the criteria are unclear, and complexity accumulates without ever being made visible by the tools.\nThe Cost of Wrong Decomposition A split is hard to reverse. Once code lives in a separate service, it gets its own data, its own deployment pipeline, its own monitoring, its own team dependencies. Merging it back means unwinding all of that.\nSystems with the wrong decomposition criterion typically show two signals. When most changes require simultaneous deployment of multiple services, the domain boundary was drawn wrong. When most calls extend into a long sync chain, communication was decided by inertia rather than by the criterion.\nThe same principle applies to split decisions inside a single codebase, but MSA leaves those decisions in a form that is hard to reverse.\nSo when in doubt, I argue for not cutting. Draw module boundaries inside the monolith first and let those boundaries settle, then cut. The right time is when the domain has firmed up, ownership is clear, and scale differences make operations hard. Once cutting becomes the goal, the decomposition criterion turns into post-hoc justification.\nReferences Horizontal vs Vertical Slicing — Horizontal/vertical splits within a single codebase. The MSA domain criterion is the service-boundary version of the same decision. HTTP/1.1 and HTTP/2 — HTTP/2 multiplexing. The transport layer where gRPC\u0026rsquo;s synchronous communication model lives. Kafka Fundamentals and KRaft Mode — Kafka producer/consumer mechanics and KRaft mode. The infrastructure for asynchronous communication. Distributed Transactions — 2PC, TCC, Saga, Outbox. The patterns chosen where the single transaction breaks. ","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/microservices-architecture/","summary":"MSA is a decision about which criterion to use to decompose the system. Domain boundary, data ownership, scale pattern, failure isolation — the chosen criterion creates the service boundaries, and those boundaries decide communication and data in turn.","title":"Microservices Architecture"},{"content":"ES (Event Sourcing) and CQRS (Command Query Responsibility Segregation) are decisions about what form to keep the source of truth in, and how to derive views from it.\nES keeps the source of truth as a sequence of changes rather than state. The current state is derived from that sequence. CQRS separates distinct views from the same source of truth. The two patterns are independent but pair into one design. When a new read model is needed, a new projection consumes the same event sequence from the start.\nflowchart LR A[Command] --\u003e B[(Event Storeappend-only)] B --\u003e C[Projection] C --\u003e D[Read ModelSearch] C --\u003e E[Read ModelAnalytics] C --\u003e F[Read ModelUI] Event Sourcing CRUD typically treats one thing as the source of truth: the current value in a row. If the orders table reads PAID, the order is paid. Changes are overwritten by UPDATE, so how it reached that state is lost.\nES replaces that with a sequence of changes. OrderCreated, PaymentRequested, PaymentCompleted, OrderShipped — events accumulate append-only, and that sequence is the source of truth. \u0026ldquo;Is the current order paid?\u0026rdquo; is answered by replaying events from the beginning. State is the result derived from the event sequence.\nThis small shift changes the system\u0026rsquo;s design premise.\nAn audit log follows naturally. Every change is already recorded as an event, so a separate audit infrastructure is unnecessary. Time-travel debugging becomes possible. Replaying to a particular point reproduces the state at that point. Adding a new view becomes free. A new projection over the event sequence yields a new read model. The cost follows. As events accumulate, replaying from scratch every time grows expensive. That\u0026rsquo;s why snapshots are introduced — store the state up to a point in time and replay only events after it. Schema changes are awkward as well. An event published years ago is still the source of truth in the system, so its shape cannot be altered casually.\nKafka is often raised as an event store. The append-only log model aligns with ES. But Kafka\u0026rsquo;s retention policy typically deletes data after a window, which conflicts with ES\u0026rsquo;s premise of permanent retention of every event.\nCQRS CRUD handles both write and read with one model. In an order domain, a single Order carries validation, state changes, search, and analytics. The setup is clean in a small system but a single model struggles to serve both demands as it grows. The write side wants business rules and transactional integrity, while the read side wants fast queries and varied views, and a single model rarely serves both well.\nCQRS accepts the asymmetry and separates the write model (Command side) from the read model (Query side). The write model is optimized for domain rules and consistency, the read model for query efficiency and representation. Both handle the same source of truth in different shapes.\nThe read model is not one but many. A search index, an analytics aggregate, a UI projection — each derived from the same source of truth. New screen, new read model.\nRead-model update timing splits three ways.\nSynchronous update — the read model updates in the same transaction as the write. Consistency is immediate but the write transaction bears more weight. Asynchronous update — the write emits an event and the read model updates in a separate flow. Eventual consistency, and the most common pattern. On-demand update — derived at read time. Suits views with low query frequency. Asynchronous updates introduce staleness: for a short window after a write, the read model may return an old value. Whether that staleness is acceptable to the business decides whether CQRS is on the table.\nCombining ES and CQRS ES and CQRS are independent but pair well.\nWhen ES treats the source of truth as an event sequence, CQRS\u0026rsquo;s read model takes that sequence as input. As events are appended, projections consume them and update the read model. The event is the read-side material, so the integration cost between the two patterns is small. A new read model means a new projection that replays events from the start.\nThis combination flows into the Saga and Outbox patterns as well. An ES event becomes a Saga trigger directly, and the \u0026ldquo;atomicity between DB write and event publication\u0026rdquo; that Outbox guarantees comes for free in ES — storing the event is storing the business data.\nThe downside: the source of truth (events) and views (read models) live in different stores, so read and write consistency is not immediate. A user may not see the result of their action on screen right away, which directly affects UX decisions.\nAdoption Criteria The decomposition criteria — domain boundary, data ownership, scale pattern, failure isolation — set the starting point for ES/CQRS adoption.\nWhen the domain demands audit/compliance as a core requirement ES fits well. Finance, insurance, healthcare. When data ownership requires multiple read-model shapes for one domain CQRS follows. Search, analytics, and a real-time dashboard each demanding a different view of the same source of truth. When the scale pattern shows different load shapes for read and write CQRS read-model separation fits. A different kind of separation from a read replica — the models themselves differ. When failure isolation disallows a read-side failure from blocking writes asynchronous read-model updates provide that isolation. Not every system needs ES/CQRS. In a system where a single CRUD suffices, adopting ES spreads derivation cost across the system. The decomposition criterion sets what ES/CQRS can give, and whether that value justifies the cost is what I weigh before adopting.\nOperational Cost ES brings audit log, time travel, and freedom in adding views, but the cost of receiving those values is spread across the system.\nReplay cost — without snapshots, every replay starts from the beginning. With snapshots, the snapshots themselves carry operational weight. Projection operations — each read model\u0026rsquo;s update flow is managed separately, with reprocessing strategies on failure. Schema change difficulty — past events cannot be altered freely, so versioning, upcasting, and weak/strong schema patterns become separate operational practice. Covered in a follow-up post. Debugging abstraction — tracing consistency between event sequences and projections is harder than reading a single state model. If audit, compliance, time-travel debugging, or freedom in adding views are not core values of the system, plain CRUD is often the right call. When the pattern itself becomes the goal of adoption, the system runs fine on the happy path while complexity accumulates without ever realizing ES\u0026rsquo;s value.\nReferences Microservices Architecture — Decomposition criteria (domain boundary, data ownership, scale, failure) and communication decisions. The premise behind judging the value of ES/CQRS adoption. Distributed Transactions — Single-transaction decomposition and reassembly with Saga and Outbox. The point where ES events become Saga triggers directly. Kafka Fundamentals and KRaft Mode — Kafka mechanics and retention policy. Background for understanding the premise gap with ES\u0026rsquo;s permanent retention. ","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/event-sourcing-and-cqrs/","summary":"ES and CQRS address how a system\u0026rsquo;s source of truth is shaped and how its views are separated from it. Adoption cost spreads across the system, so I lean toward adopting only when the value can be stated explicitly.","title":"Event Sourcing and CQRS"},{"content":"The ACID transactions familiar from a monolith are not natural in distributed environments. \u0026ldquo;Wrap payment and inventory deduction in one transaction\u0026rdquo; is a single line inside one DB but loses guarantees the moment it crosses two services. The A — Atomicity — of a single transaction disappears at the service boundary.\nDistributed transactions are about how a single ACID transaction decomposes and how its pieces are reassembled. Two branches: bind the distributed commits in one synchronous round, or let each commit locally and recover from failures through compensation.\nflowchart LR A[Monolith Single TransactionACID] --\u003e B[Decomposed at Service Boundary] B --\u003e C{Reassembly Strategy} C --\u003e|Synchronous consensus| D[2PC] C --\u003e|Local commits + compensation| E[Saga / TCC] E --\u003e F[OutboxDB ↔ Event Consistency] 2PC 2PC (Two-Phase Commit) attempts to commit distributed pieces at once. A coordinator asks every participant whether they are ready (prepare), and if all agree, sends \u0026ldquo;commit\u0026rdquo; (commit phase).\nsequenceDiagram participant C as Coordinator participant P1 as Participant 1 participant P2 as Participant 2 participant P3 as Participant 3 Note over C,P3: Phase 1 — Prepare C-\u003e\u003eP1: prepare C-\u003e\u003eP2: prepare C-\u003e\u003eP3: prepare P1--\u003e\u003eC: vote yes P2--\u003e\u003eC: vote yes P3--\u003e\u003eC: vote yes Note over C,P3: Phase 2 — Commit C-\u003e\u003eP1: commit C-\u003e\u003eP2: commit C-\u003e\u003eP3: commit P1--\u003e\u003eC: ack P2--\u003e\u003eC: ack P3--\u003e\u003eC: ack The mechanism is clean but the cost is steep. After the prepare phase, each participant locks resources until commit or abort arrives — long lock waits. If the coordinator fails between phases, participants can wait indefinitely — a single point of failure. And two round trips of latency add up on every transaction.\nFailure modes have a clear structure. If a participant votes no in prepare, or simply does not respond, the coordinator broadcasts abort and the protocol terminates cleanly. Voting yes binds the participant: it records \u0026ldquo;prepared\u0026rdquo; in its WAL and undertakes to commit when told to. The commit phase is light work, and a participant that crashes in the prepared state recovers by reading its WAL and querying the coordinator for the decision.\nTwo cases actually break consistency. If the coordinator dies after sending commit to only some participants, blocking occurs. Or the participant fails to actually commit after voting yes, due to disk failure or hardware fault — that case is outside the protocol\u0026rsquo;s guarantees, a data-corruption scenario requiring manual recovery at the application or operations layer.\n2PC achieves strong consistency, but the trade-off is explicit: availability and performance paid for consistency. It is rarely used in modern MSA, and when it is, only on narrow paths where strong consistency is essential.\nSaga / TCC Saga splits a business transaction into multiple local transactions across services and runs them sequentially. If a step fails, compensating transactions roll back the previous steps.\nTake an order flow: create order → charge payment → reserve inventory → register shipment. If inventory reservation fails, payment is refunded (compensation), the order is canceled (compensation). The single-transaction rollback becomes explicit compensation logic spread across services.\nSaga embraces eventual consistency. Brief gaps exist between steps, and during those gaps the system is in a temporarily inconsistent state. Whether that inconsistency is acceptable to the business is the core condition for adopting Saga.\nCompensation failure is the hard part of Saga. Once a forward step has committed, its compensation must complete — Saga has no \u0026ldquo;give up\u0026rdquo; state. Compensations must be idempotent and retryable. If indefinite retry is not enough, recovery either pushes forward to another valid endpoint (forward recovery — credit a wallet when a card refund fails), or escalates to manual operations.\nChoreography Each service publishes an event, and other services subscribe to advance their step. There is no central coordinator. When \u0026ldquo;OrderCreated\u0026rdquo; is published, the payment service consumes it, charges, and publishes \u0026ldquo;PaymentCompleted.\u0026rdquo; The inventory service consumes that, reserves stock, and publishes \u0026ldquo;InventoryReserved.\u0026rdquo; Each service knows its own trigger and compensation logic.\nsequenceDiagram participant O as Order participant P as Payment participant I as Inventory participant S as Shipment O-\u003e\u003eP: OrderCreated P-\u003e\u003eI: PaymentCompleted I-\u003e\u003eS: InventoryReserved Note over O,S: On failure — compensation events in reverse S--\u003e\u003eI: ShipmentFailed I--\u003e\u003eP: InventoryReleased P--\u003e\u003eO: PaymentRefunded Orchestration A central coordinator — a Saga state machine or orchestrator — controls the flow explicitly. It sends a command to the payment service, receives the response, then commands the inventory service. On failure, it issues compensation commands in reverse order.\nsequenceDiagram participant Or as Orchestrator participant P as Payment participant I as Inventory participant S as Shipment Or-\u003e\u003eP: charge P--\u003e\u003eOr: ok Or-\u003e\u003eI: reserve I--\u003e\u003eOr: ok Or-\u003e\u003eS: register shipment S--\u003e\u003eOr: failed Note over Or,S: On failure response — compensation commands in reverse Or-\u003e\u003eI: release inventory Or-\u003e\u003eP: refund payment Trade-offs Between the Two Coupling. Choreography has no direct service-to-service dependencies, but the event sequence is distributed across services. Orchestration centralizes flow knowledge in the coordinator while services remain unaware of each other. Visibility. To see how far a transaction has progressed: in Choreography, you trace logs across multiple services; in Orchestration, the coordinator\u0026rsquo;s state is enough. Debugging. Following compensation flow on failure is far more direct in Orchestration. Choreography becomes hard to trace as the event graph grows. Cohesion of business logic. Concentrating the flow of one business transaction in one place is the strength of Orchestration. Choreography splits the flow across services. In smaller systems Choreography is a light starting point, and migrating to Orchestration as flows grow complex tends to fit well. The orchestrator is another service, though, so its added complexity has to be acknowledged.\nTCC TCC (Try-Confirm-Cancel) adapts the same compensation principle into a business-level reservation. The Try phase reserves resources; if all Try calls succeed, Confirm runs; if any fails, Cancel — the compensation — is invoked.\nsequenceDiagram participant C as Coordinator participant P1 as Participant 1 participant P2 as Participant 2 participant P3 as Participant 3 Note over C,P3: Phase 1 — Try C-\u003e\u003eP1: try C-\u003e\u003eP2: try C-\u003e\u003eP3: try P1--\u003e\u003eC: reserved P2--\u003e\u003eC: reserved P3--\u003e\u003eC: reserved Note over C,P3: Phase 2 — Confirm C-\u003e\u003eP1: confirm C-\u003e\u003eP2: confirm C-\u003e\u003eP3: confirm P1--\u003e\u003eC: ack P2--\u003e\u003eC: ack P3--\u003e\u003eC: ack The difference from Saga is lock duration. Saga commits each step locally and immediately, so other transactions freely see the intermediate state. TCC keeps a reservation in place — a \u0026ldquo;reserved seat\u0026rdquo; or a \u0026ldquo;balance held for processing\u0026rdquo; — that semantically locks part of the resource during the Try-to-Confirm window. Not a real DB lock, but a brief semantic one.\nThe cost is explicit: every participating service must expose a consistent reserve/confirm/cancel trio of APIs. It gives up some of Saga\u0026rsquo;s simplicity to shorten the lock window.\nOutbox Saga and other event-driven patterns share one core limitation: DB write and event publication are not in the same transaction.\nSuppose state is saved to the DB and that fact is then published as an event. The two operations are separate. If the service fails after the DB write but before publishing, the event is lost. The reverse — publishing first, then writing — leaves an event announcing something that did not happen. Either order can break consistency.\nThe Outbox Pattern solves this by binding the two operations inside the same DB transaction. Alongside the business data write, the event to publish is recorded in a separate outbox table within the same transaction. When the transaction commits, the event is safely persisted in outbox. A separate publisher process then polls outbox or detects changes via CDC (Change Data Capture) and publishes to a message broker (commonly Kafka).\nThis setup naturally accepts at-least-once delivery. If the publisher fails after publishing but before deleting the outbox row, the same event can be published again. Consumers must therefore be idempotent, designed so that processing the same event twice yields the same result.\nOutbox guarantees consistency between DB and broker.\nPattern Selection Criteria The decomposition criteria — domain boundary, data ownership, scale pattern, failure boundary — connect to distributed transaction pattern selection.\nWhen the domain split forces strong consistency between two domains that is a sign the boundary was drawn wrong. If strong consistency is genuinely required, merging into one domain is preferable; if separation is unavoidable, restrict 2PC to a narrow path. When data ownership requires one service\u0026rsquo;s change to update another\u0026rsquo;s view or cache Saga (Choreography) plus Outbox fits well. Events propagate the update; Outbox guarantees consistency. When the scale pattern demands burst absorption Saga (Choreography) depends on a queue to flatten load. When critical and non-critical paths are separated by failure boundary non-critical paths accept eventual consistency via Saga; critical paths use strong consistency, or are redesigned into a boundary that allows a single transaction. When multiple transactions contend for the same resource and over-allocation is not acceptable TCC\u0026rsquo;s reservation fits. A shorter semantic lock than Saga ensures correctness, without going to 2PC\u0026rsquo;s DB lock — a middle point. Treating pattern choice as a natural consequence of boundary decisions rather than a technology preference turns the question \u0026ldquo;should we use Saga here\u0026rdquo; into \u0026ldquo;does the decomposition criterion demand this pattern.\u0026rdquo;\nThe Cost of Simplicity Distributed transactions are not the design of the happy path; they are the design of failure recovery. The happy path looks simple under any pattern. Differences surface in failure cases — partial failures, lost messages, duplicate processing, data inconsistency. Evaluating a pattern by mentally running its failure paths, not its success paths, is the honest measure.\nIn a monolith, a single ACID transaction provided that recovery by default. In a distributed environment, the same simplicity becomes the result of explicit cost paid. Compensation logic, idempotency design, outbox infrastructure, orchestrator operations, debugging tools. Reach for a pattern without acknowledging that cost and the system runs fine on the happy path until the first partial failure breaks consistency.\nSo pattern choice is another name for failure recovery design. Sketch what failures can occur and what state the system should recover to, then choose the pattern that fits the sketch.\nReferences Microservices Architecture — Decomposition criteria (domain boundary, data ownership, scale, failure) and inter-service communication. The premise behind distributed transaction pattern selection. Kafka Fundamentals and KRaft Mode — Kafka producer/consumer mechanics and partition/offset semantics. Background for where Outbox sits between DB and broker for consistency. ","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/distributed-transactions/","summary":"Distributed transactions are about how a single ACID transaction decomposes across services and how its pieces are reassembled. The roles and trade-offs of 2PC, Saga (Choreography vs Orchestration), and Outbox.","title":"Distributed Transactions"},{"content":"The first time you open Claude Code to customize it, the surface feels scattered. settings.json, CLAUDE.md, slash commands, subagents, hooks, plugins — the same intent has too many possible homes, and the question \u0026ldquo;where does this go?\u0026rdquo; blocks you before anything else. Dump everything into CLAUDE.md and it bloats; split it and you lose track of which file fires when.\nPick one axis and the problem almost dissolves. When does it step in? This post walks through Claude Code\u0026rsquo;s customization surface rearranged along that axis — into four layers.\nLayer Structure Layer When it steps in Responsibility CLAUDE.md + Rules Always (loaded into every turn) Tacit conventions and guardrails Agents When a Skill or model delegates to it Context-isolated specialist roles Skills When I call it Reusable workflow recipes Hooks Automatically, before/after tool use Validation, automation, safety rails That table is arguably the whole post. The rest is how each layer settles onto this axis, and how all four interlock inside a single workflow.\nCLAUDE.md + Rules The knowledge Claude loads into context every turn. It applies even when nothing is called. Two tiers.\nCLAUDE.md is the top-level context at the project/user level. It can live at ./CLAUDE.md, .claude/CLAUDE.md, or ~/.claude/CLAUDE.md — when multiple exist, they merge in hierarchy order. My CLAUDE.md holds language-independent behavioral rules.\n# CLAUDE.md (excerpt) - On rejected approach: stop immediately and ask for direction - Change scope: only change what was explicitly requested - No commits: never commit until explicitly asked - Propose approach first: for changes touching 3+ files or affecting architecture, propose the approach before writing code rules/*.md is the lower tier, split per language and domain. Drop .md files into ~/.claude/rules/ or .claude/rules/ and they\u0026rsquo;re discovered recursively. A paths frontmatter field enables scoped rules that only apply to matching file patterns.\n# rules/go.md (excerpt) - No Get prefix: GetName() ❌ → Name() ✅ - Error wrapping: bare return err forbidden. fmt.Errorf(\u0026#34;context: %w\u0026#34;, err) - No panic in libraries: only in main/tests # rules/typescript.md (excerpt) - Destructuring-first: function params ({ server, db }: Config) - Braces required: if (x) return; ❌ → if (x) { return; } ✅ - Data shapes → type, implementation contracts → interface - Use enum: prefer enum over as const objects # rules/code-principles.md (excerpt, language-agnostic) - fail-fast: raise/throw immediately on validation failure - Guard clauses: early return instead of nested if - Immutability-first: limit mutations to explicit scopes - Pure functions preferred: push side effects to call boundaries - No Any types: use generics, unions, concrete types Splitting language rules into rules/ instead of putting them in CLAUDE.md keeps CLAUDE.md from swelling.\nThe test: \u0026ldquo;does this need to be in effect regardless of whether anything is called?\u0026rdquo; If yes, it belongs in this layer.\nAgents Agents are invocation units, but they don\u0026rsquo;t run directly in the main session. They\u0026rsquo;re defined at ~/.claude/agents/\u0026lt;name\u0026gt;.md and are delegated to by a Skill (covered next) or by the model itself. The key word is context isolation. An Agent opens its own context window, drills into a single responsibility, and hands back only the result.\nTake the architect agent — it handles architecture analysis and root-cause debugging. When some skill dispatches architect, the agent reviews the design in a separate context and returns a conclusion. Everything the agent processed stays out of the main session. Or verify-agent: it runs build → typecheck → lint → test as an isolated pipeline and reports only pass or fail. refactor-cleaner removes dead code and unused imports in isolation. code-reviewer, security-reviewer, and database-reviewer each inspect implementation output through a different lens.\nThe instinct for separating Skills and Agents is this. A Skill orchestrates; an Agent is the deep execution of one responsibility. Breaking work into stages and wiring the right tools into each stage is Skill work. When one of those stages needs to run independently without polluting the main context, that slot is filled by an Agent. The diagrams in the next section show exactly how these agents get dispatched.\nThe test: \u0026ldquo;does this need to be separated from the main context?\u0026rdquo; If not, a Skill is enough.\nSkills Skills are workflow recipes I call explicitly. Each one lives at ~/.claude/skills/\u0026lt;name\u0026gt;/SKILL.md and is triggered with /skill-name. Prompt, allowed tools, and model assignment are all bundled into a single file. Instead of retyping the same instructions every time, a recipe steps in: \u0026ldquo;for this situation, use this skill.\u0026rdquo;\n/code The natural home for Skills is repetitive development work. The heaviest skill in my setup is /code. It inspects the input and auto-detects two paths: pass it a text description and it enters design (Brainstorming); pass it a .claude/plans/ directory and it enters implementation (Pipeline).\nWhen given a text description, the Brainstorming path runs. It explores requirements, decides whether to split the work, and writes a design document (DESIGN.md), per-sub-task plan files (NN-\u0026lt;task\u0026gt;.md), and a dependency graph (_dag.yaml) under .claude/plans/\u0026lt;topic\u0026gt;/. It runs the architect agent for a design review and stops. The output becomes the Pipeline path\u0026rsquo;s input.\nflowchart TD I[\"Idea input/code (Brainstorming)\"] --\u003e C[\"Context collectionproject type · CLAUDE.md · git log\"] C --\u003e R[\"Requirements explorationAskUserQuestion 1:1\"] R --\u003e A[\"2-3 approaches + recommendation\"] A --\u003e PM[\"pm-code-agentsplit decision\"] PM --\u003e|SINGLE| D1[\"DESIGN.md + _dag.yaml01-main.md\"] PM --\u003e|SPLIT| D2[\"DESIGN.md + _dag.yamlNN-task.md × N\"] D1 --\u003e AR[\"architect agent review\"] D2 --\u003e AR AR --\u003e|NEEDS REVISION| D2 AR --\u003e|APPROVED| S[\"statusdraft → ready\"] S --\u003e O[(\".claude/plans/\u0026lt;topic\u0026gt;/\")] Passing a .claude/plans/\u0026lt;topic\u0026gt;/ directory switches to the Pipeline path. The interesting part is that the details are kept out of SKILL.md and live in references/stage-*.md, pulled in with Read only at the moment they\u0026rsquo;re needed. Normal turns only carry the orchestrator in context, and each stage document loads only when that stage begins.\nSub-tasks from _dag.yaml are topologically sorted, and each one passes through five stages.\nflowchart TD I[(\"/code input.claude/plans/\u0026lt;topic\u0026gt;/\")] --\u003e L[\"Load _dag.yamltopo-sort + status gate\"] L --\u003e PRE[\"Stage Prearchitect agent+ planner agent\"] PRE --\u003e IMP[\"Stage Implparallel build(agent team)\"] IMP --\u003e POST[\"Stage Post (parallel)code-reviewersecurity-reviewerdatabase-reviewerverify-agent\"] POST --\u003e|FAIL| FIX[\"Stage Fixverify-agentauto-repair\"] FIX --\u003e POST POST --\u003e|PASS| CLEAN[\"Stage Cleanrefactor-cleaneragent\"] CLEAN --\u003e DONE[\"statusready → done\"] DONE --\u003e N{\"Next sub-task?\"} N --\u003e|yes| PRE N --\u003e|no| R[\"Final report\"] Pre (stage-pre.md) — calls architect and planner in order to do structural analysis and produce the execution plan. The result is appended to the sub-task document as ## Plan. Impl (stage-impl.md) — runs parallel implementation based on the plan. Simple work is handled directly by the leader; complex work spawns a team of agent members. Post (stage-post.md) — calls code-reviewer, security-reviewer, database-reviewer, and verify-agent in parallel for comprehensive review. PASS / NEEDS ATTENTION / FAIL is decided here. Fix (stage-fix.md, conditional) — if Post returns FAIL, verify-agent runs to auto-fix fixable errors, then Post is re-run. Bound by retry-policy.md: max 3 attempts by default, with \u0026ldquo;same error twice in a row → stall detection.\u0026rdquo; Clean (stage-clean.md) — refactor-cleaner removes dead code, unused imports, and duplication. Failure here is treated as non-critical — a warning is logged and the pipeline continues. If cleanup happens to break the build, Post catches it on the next run. When a sub-task clears all five stages, its NN-\u0026lt;task\u0026gt;.md frontmatter gets promoted from status: ready to status: done. That transition is the re-run guard — running /code again against the same plan skips any task already marked done and only picks up the rest.\nThis skill shows the basic pattern of the Skill layer. Auto-detecting paths based on input, passing stateful files between paths, and offloading details from SKILL.md to references to keep context lean.\n/github-ship — Branch to Merge in One Pipeline Once implementation is done, the code needs to land. /github-ship bundles everything from branch creation to merge into a single pipeline. It\u0026rsquo;s the consolidation of what used to be three separate skills: /git-branch, /git-commit, and /github-pr-push.\nIt runs five phases.\nBranch — analyzes the changes by concern, decides whether to split into multiple PRs, and creates a convention-aligned branch. Commit — reviews the staged diff, splits by concern, and writes convention-conformant messages. Push \u0026amp; PR — runs static analysis (lint/typecheck), pushes, and creates a PR via gh pr create. Review — scales review intensity by change size (TRIVIAL/SMALL/MEDIUM/LARGE), fires review agents in parallel. If issues are found: fix → push → re-review loop. Merge — once all review issues are resolved, offers squash or merge commit, then merges. When /code finishes with all sub-tasks passing, it automatically asks whether to run /github-ship. Approve, and implementation flows seamlessly into PR merge.\nTwo PR-related skills remain standalone.\n/github-pr-review \u0026lt;PR number\u0026gt; — deep-reviews an existing PR. Uses the same agents as github-ship Phase 4, but callable independently. /github-pr-respond — walks through review comments on a PR, confirming whether to address each, and posts replies. CLI over MCP There are two main ways to wire tools into Claude Code: MCP servers and CLI invocation. In the git/GitHub space both options exist. Yet all the skills above call CLIs like gh and git via Bash(...:*) in their allowed-tools. The reason is context savings.\nThe moment an MCP server connects, its entire tool catalog becomes resident in context. A few hundred tokens per tool, and a single server exposing around twenty tools consumes thousands of tokens \u0026ldquo;doing nothing.\u0026rdquo; Attach three such servers and 4,000+ tokens are gone before you type a single character (Scott Spence). CLI tools only spend tokens when invoked. In real comparisons, CLI reduced token usage by roughly 68% against MCP on the same workload (BSWEN — MCP vs CLI), and monthly operating costs came out 4 to 32× apart in another sample (BSWEN — Token usage).\nAnthropic is aware of this cost and has introduced lazy-loading optimizations like Tool Search, which reportedly cut overall agent tokens by 46.9% when MCP is in use (Joe Njenga, Medium). Even so, in domains where mature CLIs already exist — git and GitHub are the obvious ones — the skill + Bash combo is still the lightest option. That\u0026rsquo;s why github-ship and the other git/GitHub skills in my setup don\u0026rsquo;t touch MCP and route everything through the CLI.\nThe test: the inner axis of the Skill layer is \u0026ldquo;do I call it, or can the model call it on its own?\u0026rdquo; The disable-model-invocation flag draws that line — write operations that are risky or hard to reverse stay locked, while anything that needs to auto-trigger for everyday productivity stays open.\nHooks Hooks are never invoked. They react to tool events and run automatically. Register them in settings.json under hooks as PreToolUse / PostToolUse, and shell scripts step in before or after specific tool calls.\nHooks cover two slots.\nSafety rails — block dangerous commands before they execute. remote-command-guard.sh hooks in before a Bash call (PreToolUse), checks categories like rm -rf, curl | sh, and reads of /etc/passwd, and blocks with exit 2 if anything matches. Automation — running formatters after an edit (format-file.sh), nudging for security review when a sensitive file is touched (security-auto-trigger.sh), masking secrets in every tool\u0026rsquo;s output (output-secret-filter.sh). The \u0026ldquo;things you don\u0026rsquo;t want to do by hand but must do every time\u0026rdquo; slot. Permission allow/deny pairs with Hooks. The permissions.deny list in settings.json is a static filter. Declare a pattern like Bash(*rm -rf*) and matching commands never reach the tool call at all. Hooks layer dynamic checks on top. The context-dependent risks that static filters miss (specific redirect targets, conditional combinations) get judged by the script. Static declarations + dynamic inspection, the two tiers bundled into one layer that plays the role of safety rail.\nThe test: \u0026ldquo;does this need to intervene automatically on a tool event?\u0026rdquo; The no-invocation requirement is what separates Hooks from Skills and Agents.\nIntegrated Workflow Taken one at a time, each layer\u0026rsquo;s responsibility stays sharp. But in practice all four run inside the same workflow. Here\u0026rsquo;s one flow as an example.\nI type /code \u0026quot;design a new auth module\u0026quot; → a Skill starts moving on the Brainstorming path. The Skill dispatches architect and planner internally → Agents perform design analysis and step decomposition inside isolated contexts. Throughout all of this, every turn has language-specific rules and code principles loaded into context → Rules are quietly in effect. Once design is done, /code asks whether to switch to Pipeline. Approve, and it starts writing files → Edit / Write fires over and over. Each time, Hooks react. format-file.sh runs the formatter, code-quality-reminder.sh nudges error handling and immutability checks, and for security-related files security-auto-trigger.sh requests review. At the end of the pipeline, /code calls verify-agent again → build, typecheck, lint, and tests run in isolation and only the result comes back. If everything passes, /code asks whether to run /github-ship → approve, and another Skill takes over: branch → commit → push → review → merge in one shot. All four layers participate in a single task, and none of their responsibilities overlap. What I called (Skills), what the Skill delegated (Agents), what was always in place (Rules), what reacted to an event (Hooks) — none of these four overlap.\nSummary When deciding where a new piece of configuration goes, four questions are enough.\nMust it always be in effect? → CLAUDE.md + Rules Must it react automatically to a tool event? → Hooks Is it a workflow that runs when invoked? → Skills Must it run in a separate context? → Agents If more than one applies, split the responsibility. Full config reference: .dotfiles/claude/.\n","permalink":"https://wid-blog.github.io/en/posts/tech/devenv/claude-code-config-layers/","summary":"settings.json, CLAUDE.md, slash commands, subagents, hooks. Claude Code\u0026rsquo;s customization surface settles into four layers once you pick one axis: when does each one step in?","title":"Claude Code Config in Four Layers"},{"content":"Working seriously with Claude Code pulled my CLI-based editor habit way up. The flow of \u0026ldquo;open a terminal from a GUI editor\u0026rdquo; flipped into \u0026ldquo;open everything from inside a terminal.\u0026rdquo;\nThis post covers the dotfiles that produce the screen above. It walks through why each tool was chosen and how they fit together.\nThe deeper Claude Code configuration — skills, agents, hooks, MCP — will be covered in separate posts, so I only touch it lightly here.\nStack The terminal emulator is alacritty. tmux splits the screen into three inside an alacritty window. zsh runs as the shell in each pane. The editor in the top-left pane is nvim on a LazyVim base, and Claude Code runs as an AI pair in the right 30% pane.\nTool Role alacritty terminal emulator tmux session/window/pane multiplexer zsh shell nvim (LazyVim) editor (top-left pane) Claude Code AI pair (right 30% pane) This post walks through them in that order — alacritty, tmux, zsh, nvim, Claude Code — covering what each does and why it was chosen.\nalacritty alacritty is a cross-platform, GPU-accelerated terminal emulator written in Rust. It keeps its own feature set minimal, delegating splits and session management to other tools.\nI picked alacritty as the terminal emulator. Three reasons.\nGPU rendering — OpenGL-based rendering, so input latency stays low. Config-as-code — every setting lives in a single alacritty.toml. No hunting around for where things were saved. Simplicity — alacritty intentionally omits tabs, splits, and sessions. That space is for tmux to fill. The last point is the key one. By delegating splits and sessions to tmux instead of letting alacritty own them, the same abstraction works identically on macOS and Linux. The simpler the layer below, the more portable the layer above — that\u0026rsquo;s how I saw it.\nThe default shipped config is nearly empty. Colors, fonts, window decorations — until the user fills them in, alacritty is as close as it gets to a \u0026ldquo;raw terminal.\u0026rdquo; That empty state is where config-as-code begins.\nMy customizations are simple. Window decorations turned off to hide the macOS title bar. Padding removed so no pixel is wasted. The font set to a nerd font variant so nvim\u0026rsquo;s devicons render. Colors set to catppuccin mocha, written directly into the toml. No external yml, no includes — one file, done.\nThe keybindings translate Cmd key combinations into ESC sequences.\n[keyboard] bindings = [ { chars = \u0026#34;\\u001Bh\u0026#34;, key = \u0026#34;H\u0026#34;, mods = \u0026#34;Command\u0026#34; }, { chars = \u0026#34;\\u001Bl\u0026#34;, key = \u0026#34;L\u0026#34;, mods = \u0026#34;Command\u0026#34; }, { chars = \u0026#34;\\u001Bw\u0026#34;, key = \u0026#34;W\u0026#34;, mods = \u0026#34;Command\u0026#34; }, ] There is one macOS-specific issue. Cmd+H gets intercepted at the OS menu level as \u0026ldquo;Hide Application.\u0026rdquo; That key is supposed to be translated by alacritty\u0026rsquo;s keybindings into ESC+h (vim\u0026rsquo;s M-h) and forwarded to nvim, but if AppKit consumes it first, the translation never happens. That is why setup-macos.sh adds one extra line at the end.\ndefaults write org.alacritty NSUserKeyEquivalents -dict-add \u0026#34;Hide Alacritty\u0026#34; \u0026#34;\u0026#34; That single line is what makes alacritty\u0026rsquo;s nvim integration work on macOS. The config file cannot solve this problem, so it requires an OS-level defaults write command.\ntmux tmux is a terminal multiplexer. It manages multiple sessions, windows, and panes inside a single terminal, and keeps processes alive even after detaching.\nSplitting the screen on top of alacritty is tmux\u0026rsquo;s job. That\u0026rsquo;s why alacritty has no tabs; tmux\u0026rsquo;s sessions / windows / panes fill that role instead.\nThe config stays close to defaults. The prefix stays at C-b (C-a collides with readline\u0026rsquo;s line-start and interferes in the shell). Copy mode runs on vim keys, and the ESC delay between nvim and tmux is removed. That last setting is small but nvim users notice the difference immediately.\ntmux waits briefly after receiving ESC to decide whether it starts a prefix or meta sequence. set -gs escape-time 0 removes that wait, and mode switches in nvim happen instantly.\nTwo keybinding decisions are directly relevant. The shortcut that opens a Claude Code pane on the right, and the shortcut that normalizes that layout in one key. The nc function described below is a higher-level tool that combines these splits into a single function call.\nbind i split-window -fh -p 30 -c \u0026#34;#{pane_current_path}\u0026#34; \u0026#34;claude\u0026#34; bind o split-window -v -l 15% -c \u0026#34;#{pane_current_path}\u0026#34; prefix i creates a Claude Code pane on the right; prefix o creates a shell pane at the bottom. These are the manual equivalents of what nc automates.\nAnother important decision is unifying pane navigation between vim and tmux behind the same keys. Pressing C-h/j/k/l without prefix, nvim moves to its left split if focused, or tmux moves to the left pane if in a shell pane. The tmux side checks whether the current pane is running a vim-family process and branches automatically. The boundary between tools disappears. No prefix key is needed to move between three panes.\nis_vim=\u0026#34;ps -o state= -o comm= -t \u0026#39;#{pane_tty}\u0026#39; \\ | grep -iqE \u0026#39;^[^TXZ ]+ +(\\\\S+\\\\/)?g?(view|n?vim?x?)(diff)?$\u0026#39;\u0026#34; bind-key -n \u0026#39;C-h\u0026#39; if-shell \u0026#34;$is_vim\u0026#34; \u0026#39;send-keys C-h\u0026#39; \u0026#39;select-pane -L\u0026#39; bind-key -n \u0026#39;C-j\u0026#39; if-shell \u0026#34;$is_vim\u0026#34; \u0026#39;send-keys C-j\u0026#39; \u0026#39;select-pane -D\u0026#39; bind-key -n \u0026#39;C-k\u0026#39; if-shell \u0026#34;$is_vim\u0026#34; \u0026#39;send-keys C-k\u0026#39; \u0026#39;select-pane -U\u0026#39; bind-key -n \u0026#39;C-l\u0026#39; if-shell \u0026#34;$is_vim\u0026#34; \u0026#39;send-keys C-l\u0026#39; \u0026#39;select-pane -R\u0026#39; is_vim inspects the pane\u0026rsquo;s process. If a vim-family editor is running, the keystroke goes to nvim; otherwise tmux handles the pane switch.\nzsh zsh is a Bash-compatible shell with strong completion, extended globbing, and a plugin ecosystem. It has been the default shell on macOS since Catalina.\nzsh has two halves. A .zshrc that handles PATH and environment, and an aliases.zsh that holds functions and aliases. ZDOTDIR points at ~/.config/zsh, and .zshrc sources every *.zsh in that directory.\nZDOTDIR=$HOME/.config/zsh for _zsh_conf in $ZDOTDIR/*.zsh(N); do source \u0026#34;$_zsh_conf\u0026#34; done Thanks to this pattern, I can split aliases / functions / plugin configuration into separate files. New function? Drop a new .zsh file. .zshrc itself almost never gets touched.\nThe nc function is defined as follows.\nfunction nc() { if [[ -z \u0026#34;$TMUX\u0026#34; ]]; then echo \u0026#34;Not inside a tmux session. Run from within tmux.\u0026#34; return 1 fi local target=\u0026#34;${1:-$PWD}\u0026#34; local dir if [[ -d \u0026#34;$target\u0026#34; ]]; then dir=\u0026#34;$(realpath \u0026#34;$target\u0026#34;)\u0026#34; else dir=\u0026#34;$(realpath \u0026#34;$(dirname \u0026#34;$target\u0026#34;)\u0026#34;)\u0026#34; fi local nvim_pane nvim_pane=\u0026#34;$(tmux display-message -p \u0026#39;#{pane_id}\u0026#39;)\u0026#34; tmux split-window -h -c \u0026#34;$dir\u0026#34; -l 30% \u0026#34;claude; exec $SHELL\u0026#34; tmux select-pane -L tmux split-window -v -c \u0026#34;$dir\u0026#34; -l 15% tmux select-pane -t \u0026#34;$nvim_pane\u0026#34; nvim \u0026#34;$@\u0026#34; } Line by line.\nThe first guard checks we\u0026rsquo;re inside tmux. Calling nc outside tmux is meaningless since the splits have nowhere to go. Print one line, return 1.\nNext is deciding the working directory. No argument means $PWD. If the target is a directory, use it as-is; if it\u0026rsquo;s a file, use its parent. realpath makes it absolute. This dir becomes the cwd for all three panes. Open a file in nvim and run git status in the side pane and you\u0026rsquo;ll see the same repo.\nThen, save the current pane\u0026rsquo;s ID. Focus needs to come back to nvim after the splits, but pane ids shift mid-split, so we grab it ahead of time.\nNow the actual two splits.\nFirst — tmux split-window -h -c \u0026quot;$dir\u0026quot; -l 30% \u0026quot;claude; exec $SHELL\u0026quot; creates a 30%-wide pane on the right and runs claude in it. The exec $SHELL switches the pane to a shell after claude exits, so it does not close immediately.\nSecond — tmux select-pane -L jumps back to the left (the original nvim slot), and tmux split-window -v -c \u0026quot;$dir\u0026quot; -l 15% cuts that left side top and bottom. The bottom 15% becomes a small terminal pane.\nFinally, launch nvim. Move focus back to the saved nvim_pane (now the top-left) and run nvim \u0026quot;$@\u0026quot;. If the argument is a file, that file opens; if it\u0026rsquo;s a directory, LazyVim\u0026rsquo;s dashboard appears.\nThe result is this layout, produced by a single nc call.\n┌───────────────────────────┬──────────────┐ │ │ │ │ nvim │ claude │ │ (LazyVim) │ (30%) │ │ │ │ ├───────────────────────────┤ │ │ shell (15%) │ │ └───────────────────────────┴──────────────┘ A few helper aliases wrap this function.\nalias zrc=\u0026#34;nc ~/.config/zsh/\u0026#34; alias nvimrc=\u0026#34;nc ~/.config/nvim/\u0026#34; alias alc=\u0026#34;nc ~/.config/alacritty/\u0026#34; alias tlc=\u0026#34;nc ~/.tmux.conf\u0026#34; So one zrc opens the zsh config directory in nvim with Claude Code already sitting in the side pane. Edit a config file, get a review in the next pane over.\nOther aliases include f (pick a file via fzf and open in nvim), g (lazygit), ?? (fabric-ai), ? (w3m search). nc is the center of this post because one function defines where five tools are placed, all at once.\nnvim Neovim is a refactored fork of Vim, adding asynchronous plugins, a built-in LSP client, and Lua-based configuration. LazyVim is a configuration framework on top, providing sensible defaults and a modular extras system.\nThe editor is nvim on a LazyVim base. Instead of writing init.lua from scratch, I chose to inherit LazyVim\u0026rsquo;s reasonable defaults. LSP, treesitter, finder, mason-based LSP installation — all wired up out of the box; building the same thing by hand takes days. And because language extras toggle line-by-line inside lazyvim.json, adding a new language is one line plus :Lazy sync, which installs LSP / treesitter / formatter in one shot. Currently 32 extras are active, covering 14 languages alongside coding, editor, formatting, and test tooling.\nCustomization splits into two places. lua/config/* is where I override LazyVim\u0026rsquo;s defaults (keymap, option, autocmd overrides); lua/plugins/* is where new plugins or extra options for LazyVim extras land. Per-language settings are bundled into a single plugins/language/\u0026lt;lang\u0026gt;.lua. Removing go support means deleting that one file.\nlua/plugins/language/ ├── go.lua ├── html.lua ├── java.lua ├── markdown.lua └── typescript.lua I won\u0026rsquo;t go deeper into nvim itself. Keymaps, LSP configuration, debugger integration, snacks.nvim picker, harpoon2 workflow are each a separate post. This post\u0026rsquo;s scope ends at \u0026ldquo;the pattern of building modules on a LazyVim base.\u0026rdquo;\nClaude Code The right 30% pane belongs to Claude Code. It installs through one brew cask line (cask \u0026quot;claude-code\u0026quot;), and nc creates its position. Its role on the screen is simple: while you edit on the left, it pairs with you on the right with the same directory in context.\nThe dotfiles claude/ module actually carries more. settings, agents, hooks, rules, skills all get stowed under ~/.claude/, and they tune Claude Code\u0026rsquo;s behavior in fine grain. agents handle task delegation, hooks handle automation at file-save time, skills hold reusable workflows, rules hold per-language and shared code conventions.\nBut this post\u0026rsquo;s slot ends at \u0026ldquo;five tools in one screen.\u0026rdquo; Each Claude Code component — settings, agents, hooks, skills, MCP, output styles — will be covered in separate posts.\nThe one thing to take away from this post is simple. Claude Code runs in the right 30% pane that nc creates. Nothing more, nothing less. The layout starts the tool; the tool operates within the layout.\nLimits and Trade-offs A few cases where this setup doesn\u0026rsquo;t fit.\nWork that depends on a GUI debugger. If browser devtools or a heavyweight IDE\u0026rsquo;s visual debugger is your daily tool, a terminal-centric layout will keep pulling you between two worlds. This setup rests on the assumption that code editing + shell + AI pair cover 99% of the work.\nPair programming over screen share. When you show your screen to a colleague, the intent behind nvim keybindings often doesn\u0026rsquo;t read. Seeing dd delete a line can be confusing for viewers unfamiliar with vim. If pair programming is frequent, a GUI editor has a lower communication cost.\nDifferences from the Linux setup. The same dotfiles work on Linux too, but the parts this post doesn\u0026rsquo;t cover — Hyprland window manager, Kime input method, Linux-specific packages — are separate. On macOS, alacritty handles the terminal emulator role; on Linux, Hyprland takes part of it.\nThe missing pieces, called out honestly. The input method (kime), the window manager (hypr), the keymapper (karabiner) — they fall outside this post\u0026rsquo;s scope, and I left them out. Input method choice does ultimately matter for Korean-speaking developers, but I judged that to be a separate post.\nClosing Starting from where we began: one nc call, a screen split into three, five tools each handling their role. The dotfiles are a set of decisions about what role alacritty / tmux / zsh / nvim / Claude Code each play.\nFollow-up posts will cover Claude Code\u0026rsquo;s internals (settings, agents, hooks, skills, MCP) one at a time. If this post covered the layout itself, the next ones cover the decisions within each tool.\n","permalink":"https://wid-blog.github.io/en/posts/tech/devenv/macos-dev-environment/","summary":"alacritty + tmux + nvim + zsh + Claude Code in a single screen. The choices and structure behind a terminal-centric development environment.","title":"macOS Dev Environment: Dotfiles"},{"content":"I threw away the AirPods I\u0026rsquo;d been using for over a year — a prize I won at a company event.\nIn 2022, I joined my current company as a backend engineer. There was a lot to learn in this new domain, and I focused on tackling each task one at a time.\nBuild a feature, resolve an issue, move on to the next sprint. The cycle itself wasn\u0026rsquo;t the problem.\nThe problem was that somewhere along the way, I had closed my ears.\nWhen sharing technical context or the reasoning behind decisions with colleagues, I thought I was communicating — but in reality, the message often didn\u0026rsquo;t land. In code reviews, technical discussions, incident responses — I was poor at translating what was in my head into a form others could understand, and poor at listening to others on their terms.\nWorking in my own bubble became a habit, and burnout followed.\nAfter much deliberation, I proposed a three-month sabbatical.\nI didn\u0026rsquo;t want to simply rest and call it done. I wanted to reflect on what I was lacking and spend the time improving.\nI used to hate documentation at work. I knew it mattered, but committing thoughts to writing always felt like a chore.\nBut communication isn\u0026rsquo;t a skill you build in your head. It builds slowly — by putting things out, trying to convey them, and sitting with how they land.\nSo I\u0026rsquo;m starting this blog. Not to write well — to practice the act of writing and sharing itself.\nTechnical things, things I\u0026rsquo;ve felt at work — I want to keep putting my thoughts out there, and build the muscle of sharing.\nI used to think \u0026ldquo;a good engineer = someone who knows technology well.\u0026rdquo; That\u0026rsquo;s not wrong, but it\u0026rsquo;s an incomplete definition.\nIf what I know never reaches my team, that knowledge might as well not exist.\nA good engineer isn\u0026rsquo;t someone who knows technology well, but someone who can share that knowledge with their team.\nThat\u0026rsquo;s why I threw away my AirPods. To listen — and to reach out.\n","permalink":"https://wid-blog.github.io/en/posts/career/dable/starting-sabbatical/","summary":"A good engineer isn\u0026rsquo;t someone who knows technology well, but someone who can share that knowledge with their team.","title":"I Threw Away My AirPods"},{"content":"A security compliance task was due.\nCertain columns in a running service were the target for encryption. They came in two shapes: column values that were themselves sensitive, and JSON-stored columns where only specific fields needed encryption. This was no greenfield system; traffic was already flowing.\nThe work split into two halves. One was building the encryption module. The other was applying that module to a running service. The second turned out to be larger.\nEncryption Strategy I picked symmetric AES-256-GCM. Key management followed an envelope encryption structure — a CMK encrypts a DEK, and the DEK encrypts the data. The two-tier structure limits the blast radius of a key leak and simplifies key rotation. The mechanics live in a separate tech post.\nFor key storage, a managed secret store won out. A managed KMS and a system configuration store were the alternatives, but for operating costs and the specific role of storing DEKs, the secret store fit best. An early design review surfaced that the system configuration store is meant for system configuration, not key storage.\nDEK Granularity — From Row to Table The initial design used per-row DEKs. Each row received its own DEK, stored alongside the row. A key leak would stay scoped to that single row.\nAn early design review pushed back: the operational and maintenance complexity was climbing too high. After re-examining the trade-off, I moved to per-table DEKs.\nPer-row keys grow with row count — every new row means another key issuance call and additional storage. In production, the impact runs deeper than raw cost: key API call frequency, backup/restore throughput, per-row key issuance logic during migration. The whole system gets heavier.\nPer-table keys widen the blast radius to a single table, but operations simplify. Separating keys per sensitivity tier narrows the blast radius by a different criterion.\nThe answer I first judged \u0026ldquo;safer\u0026rdquo; wavered once operations entered the picture. The weight of a decision lands only after both sides of the trade-off come into view.\nInternal Module — Two Patterns The two data shapes called for two different processing paths.\nThe first: replace the entire column value with a single ciphertext. Applied when the column itself is sensitive.\nThe second: replace only the relevant field values inside a JSON object with ciphertext. The object structure and non-sensitive fields remain plaintext.\nWithout bundling both patterns into one module, callers split into two paths. The module exposes both as first-class APIs.\nMigration — Three-Stage, Zero-Downtime A running service rules out a single-shot column swap. I split it into three stages.\nPrepare. Add encrypted columns via DDL. Apply dual writes in code — INSERTs and UPDATEs hit both plaintext and ciphertext columns, while SELECTs decrypt the encrypted column if present and fall back to plaintext otherwise. Migrate. Bulk-encrypt the existing plaintext rows into the new columns. Run dry-run first to confirm scope and timing, then execute with a tuned batch size. Clean up. Verify the new columns are fully populated, then drop the plaintext columns and remove the fallback branches. Each stage gates on the previous PR being merged and deployed. Stage N+1\u0026rsquo;s code assumes stage N is already live everywhere.\nThe WHERE Clause and HMAC A constraint surfaced mid-migration.\nSome columns were used in WHERE conditions — lookup queries, deduplication checks. Encrypting them outright breaks those queries. AES-GCM produces a different ciphertext for the same plaintext on every encryption, so WHERE email = '...' equality comparisons stop being meaningful.\nPreserving searchability required a deterministic transformation. I added an HMAC column alongside the encrypted one. On write, the original value gets hashed once and encrypted once — two stored representations. Lookups go through the HMAC column; full value recovery goes through the ciphertext column.\nThis constraint never showed up in the column survey. Column names and types say nothing about how a column actually gets used in queries. The codebase had to be read directly to surface it.\nSpreading Across the Org — A Migration Automation Skill A working module is not the end. The target columns spread across many services, and someone had to write the three-stage migration for each one.\nPeople repeating the same procedure leak mistakes. Issue tracker tickets came in inconsistent shapes, making column-info parsing fragile, and migration script dry-runs varied person to person.\nI built an automation Skill so any engineer could run the same procedure end to end. It parses column information from standardized metadata on the issue tracker ticket, generates a migration script matched to the module\u0026rsquo;s API, and walks through dry-run inspection before live execution.\nThe issue tracker ticket format got standardized alongside. Server, database, table, column name, type, sensitive fields — a defined table layout, with the Skill prompting for missing fields when the description falls short.\nThe earlier hackathon experience of reaching for AI tools — back then, just to ship fast — shifted here toward shaping a standard organizational procedure.\nTakeaways Applying the module weighed more than building it. Envelope encryption, the two patterns, the three-stage migration all started as standard patterns, but got reshaped in production. I moved DEK granularity from row to table for operational cost. I added an HMAC companion column to keep both AES-GCM confidentiality and search. I built a Skill to standardize the same procedure across multiple services. That process was where the actual work lived.\nA design does not settle in one pass. Row-level moved to table-level, both patterns turned out to be required, and HMAC entered mid-flight. The answers I first judged correct kept getting reshaped as operations pressed back.\nIn the end, making the module usable mattered as much as building it. The Skill filled that gap. Security compliance was the immediate goal, but what stayed behind was an organizational standard for handling sensitive data.\nReferences Envelope Encryption — the CMK/DEK two-tier key structure and its mechanics. Internal Hackathon — First Place Retrospective — the starting point for using AI tools in my work. ","permalink":"https://wid-blog.github.io/en/posts/career/dable/sensitive-data-encryption-retrospective/","summary":"A retrospective on column-level encryption of sensitive data in a running service. Envelope encryption, DEK granularity decisions, the WHERE clause constraint that led to HMAC, and the migration automation Skill that spread the work across the org.","title":"Sensitive Data Encryption — Module Design and Migration Retrospective"},{"content":"I joined the internal hackathon.\nIt was a chance to work alongside colleagues I rarely worked with before and reach a result in a short window. It was also the moment I started using AI tools.\nThe Idea When you use a corporate card, you have to file an expense voucher in the groupware. The company has an expense voucher guide, and every time you have to find that guide and fill in the form by it. A small lunch payment still demands the right category, memo, tax code — every field by the rule.\nI proposed automating that with an LLM Agent. The user types the payment memo, the Agent reads the company\u0026rsquo;s guide, and fills the voucher form. The scope fit a hackathon, and the value showed even at that scope.\nShifting to Chrome Extension We started with a Slack Bot in mind — already familiar inside the company. Then a teammate suggested Chrome Extension instead. With it, the user can run the chatbot on top of the groupware page and not break flow. That switch was the decisive change for usability.\nThe team brought together developers and non-developers I rarely worked with before. Chrome Extension UI, backend Agent response shape, how to inject the company guide into the LLM, the review flow for users — each person\u0026rsquo;s strengths met at one outcome in a short window. Decisions that would have been a long thread on a chat tool got resolved next to each other in a sentence or two.\nWhat We Built A Chrome Extension chatbot wired to a backend LLM Agent.\nA user opens the chatbot on a groupware page, types the payment memo, and the backend Agent reads the company\u0026rsquo;s expense voucher guide and returns a filled form. The user reviews the result and applies it in the groupware.\nThe tool combination came together this hackathon, too. For development, I started using Claude Code. For the Agent\u0026rsquo;s backend LLM, we used ChatGPT\u0026rsquo;s structured output. Both were first proper uses for me.\nWhere AI Tools Started for Me Claude Code let me get to a result fast. Volume that would have been hand-written before passed quickly, and I moved on to the next decision sooner.\nBut not splitting work into small enough units made mistakes accumulate. A large change handed off in one go shifted parts I intended and parts I did not, and chasing those differences afterward took longer than the time I saved. Some of the code that came out was hard to maintain.\nIn the follow-up work after the hackathon, refactoring AI-written code became a separate task on its own. That was the cost of moving a fast prototype into actual service. I see it as material for the next step rather than a regret.\nPresentation and Launch I gave the company-wide presentation. I do not present often, and the pressure of summarizing in a short window was a good push. The team won 1st place.\nThe work after the hackathon was longer than the hackathon itself. The project was launched internally, and shaping it into a real service required about two months of follow-up improvements. Chrome Extension stabilization (TypeScript port included), Agent response shape, batch processing, the review flow for users — the hackathon POC kept being shaped into a real service.\nThere is still room for improvement, but the hackathon output is in actual use inside the company — that is the largest outcome.\nWhat the Starting Point Means What started here matters more than the 1st-place result. Bringing colleagues who do not usually work together to one outcome in a short window is rare inside a company. And this hackathon was where I started using Claude Code in earnest. I saw the limits along with the value, but the material for the next step came together in that time.\n","permalink":"https://wid-blog.github.io/en/posts/career/dable/worthy-hackathon-retrospective/","summary":"A retrospective on the internal hackathon. How an idea I proposed evolved with the team into a 1st-place project and an internal launch — and the starting point for using AI tools in earnest.","title":"Internal Hackathon Retrospective — 1st Place"},{"content":"JIRA covers the work unit that pairs with the flow of code — issues and tickets. When the Sprint, the ticket lifecycle, and Git/GitHub integration move together as a single bundle, the daily cost of context switching drops noticeably.\nIssue and Workflow The core of JIRA reduces to two pieces — issues and the workflow that runs on them.\nAn issue is a unit of work. Each issue carries a type (Story, Task, Bug, Epic, Subtask), a state, an assignee, and a set of free-form metadata (labels, priority, story points). Using JIRA ultimately means creating, grouping, and transitioning these issues.\nA workflow is the set of state-transition rules an issue passes through. The simplest version looks like this:\nflowchart LR Open[Open] --\u003e InProgress[In Progress] InProgress --\u003e InReview[In Review] InReview --\u003e Done[Done] InReview --\u003e InProgress Open --\u003e Closed[Closed] Each arrow corresponds to one transition. Teams customize states and transition rules to match their process — the simpler it is, the easier it is for people to follow; the more complex it is, the more it has to be handled exclusively through automation.\nThe key principle in workflow design is restraint on the number of states. As fine-grained states like \u0026ldquo;Pending Review\u0026rdquo;, \u0026ldquo;Ready for QA\u0026rdquo;, \u0026ldquo;QA in Progress\u0026rdquo;, and \u0026ldquo;Ready for Release\u0026rdquo; pile up, more time is spent moving issues around — and eventually people start ignoring states altogether.\nSprint A Sprint is a time-boxed bundle of issues. One- to two-week durations are typical, and the goal is to finish what\u0026rsquo;s bundled inside that window.\nA Sprint\u0026rsquo;s lifecycle has three stages.\nSprint Planning: pick issues from the backlog into the next Sprint. Compare team capacity against issue estimates and avoid overcommitting. Sprint execution: move issues through In Progress → In Review → Done. Daily standups surface progress and blockers. Sprint Review/Retro: at the end, check what got done, what carried over, and how to do the next Sprint better. Sprint scope creep — issues being added mid-Sprint — is the most common failure mode. Once the planned scope is no longer kept, the Sprint reduces to a time slice with no planning meaning behind it. When urgent issues must enter mid-Sprint, requiring an equivalent amount of work to be removed in trade is a useful safeguard.\nIssue Hierarchy Most teams use a hierarchy along these lines.\nLevel Meaning Example Epic Large bundle of work, usually spans multiple Sprints \u0026ldquo;Migrate payment system to v2\u0026rdquo; Story A unit of user value, fits inside one Sprint \u0026ldquo;User can save card information\u0026rdquo; Task A unit of technical work \u0026ldquo;Implement payment API endpoint\u0026rdquo; Bug A defect report \u0026ldquo;Error message missing on payment failure\u0026rdquo; Subtask A smaller piece inside a Story or Task \u0026ldquo;Write unit tests for payment API\u0026rdquo; The hierarchy is convention, not enforcement, so teams reshape it to match their flow. Teams that find Story and Task hard to distinguish often merge them; some teams replace Subtasks with simple checklists.\nThe essential split is Epic vs Story. Work that doesn\u0026rsquo;t fit in a Sprint should be grouped as an Epic and broken down into Stories — that\u0026rsquo;s what makes planning at the Sprint level possible.\nGit/GitHub Integration When JIRA pairs with Git/GitHub, issues and code changes get connected automatically. The conventions are simple.\nIssue key in the branch name\nfeature/PROJ-123-add-search JIRA recognizes PROJ-123 as the issue key.\nIssue key in the commit message\nPROJ-123: add search endpoint JIRA links this commit to issue PROJ-123 automatically.\nIssue key in the PR title or body\nPROJ-123: Add search endpoint closes PROJ-123 When the PR opens, JIRA surfaces a PR link on the issue\u0026rsquo;s page. PR state changes (open/merged) flow back as well.\nSmart commits are syntax that lets commit messages drive issue transitions directly. Writing PROJ-123 #close or PROJ-123 #time 2h will close the issue or log work time when the commit is merged.\nThe biggest payoff is preventing context loss. Being able to see commits and PRs from the issue page, and to jump back to the issue from the PR, dramatically lowers the cost of asking \u0026ldquo;why is this code the way it is?\u0026rdquo;.\nJIRA Automation JIRA includes a rules engine that triggers state transitions automatically.\nA few common patterns:\nPR open → In Review: when a PR linked to the issue opens, move the issue to In Review automatically PR merged → Done: when the PR merges into main, move the issue to Done Carry over at Sprint end: unfinished issues automatically move to the next Sprint SLA alerts by Bug priority: a P1 bug that stays In Progress longer than 24 hours triggers a Slack alert Auto-assign by label: issues with a specific label get auto-assigned to the team\u0026rsquo;s on-call Automation should start small. Building a complex bundle of rules from day one makes debugging painful, and a misfiring rule can move issues around in unintended ways. Beginning with simple PR-state synchronization and expanding once trust has built up is the safer path.\nCommon Pitfalls Issues and commits drift apart Without enforcing the issue-key convention, commits don\u0026rsquo;t auto-link to issues. JIRA ends up used only as a PM tool while the code flow is tracked solely on GitHub — a clean split that costs context every time someone needs to connect them. Linting branch names or adding an issue-key field to the PR template are common safeguards.\nWorkflows that are too complex A workflow with five-plus states, multiple branches, and approval steps becomes too much for people to follow. The result is workarounds — people drag issues that \u0026ldquo;should be Done\u0026rdquo; straight to Done and skip the intermediate states. A simpler workflow backed by automation is almost always the better choice.\nSprint scope creep When new issues keep flowing in after planning, the Sprint loses its planning meaning and becomes just a time slice. Forcing an equivalent amount to be removed in trade when urgent issues come in is a structural safeguard. The PM or team lead owns that decision.\nIssues for the sake of issues When small tasks each get their own issue with full metadata, the issue system becomes a job in itself. Teams need a shared agreement on what\u0026rsquo;s worth tracking — \u0026ldquo;work that requires synchronization with someone else\u0026rdquo; is usually a good cutoff.\nSeries Wrap-up This series treated developer collaboration tooling through four elements.\nGit (Article 1) — a graph of changes; commit / branch / merge·rebase GitHub PRs (Article 2) — a collaboration layer on top of the graph; PR-level design + Code Review GitHub Actions (Article 3) — automated verification and deployment; workflow / job / step JIRA (Article 4) — the work units paired with the graph; issues + Sprint + Git/GitHub integration Each tool carries its own abstraction, but the four shine when they\u0026rsquo;re paired together. Issue keys flowing through branches, commits, and PRs; PRs gated by automatic verification; merges that auto-transition issue states. The automation built on those connections is what cuts the daily cost of context switching, and team velocity gets built on top of it.\n","permalink":"https://wid-blog.github.io/en/posts/tech/devenv/jira-sprint-workflow/","summary":"Looking at JIRA\u0026rsquo;s issues and workflows as a graph of work units — covering the Sprint lifecycle, issue hierarchy, Git/GitHub integration patterns, and automation flows.","title":"JIRA Sprint Workflows and Git/GitHub Integration"},{"content":"GitHub Actions is an event-driven automation engine that triggers build, test, and deploy on repository events and runs them automatically. The automation logic itself lives as code inside the repository.\nWorkflow / Job / Step Every Actions automation reduces to three units.\nflowchart TB Event[(\"Repository event(push / pull_request / schedule ...)\")] Event --\u003e WF[\"workflow(.github/workflows/*.yml)\"] WF --\u003e J1[\"job A(separate runner)\"] WF --\u003e J2[\"job B(separate runner)\"] J1 --\u003e S1[\"step 1\"] J1 --\u003e S2[\"step 2\"] J1 --\u003e S3[\"step 3\"] J2 --\u003e S4[\"step 1\"] J2 --\u003e S5[\"step 2\"] A workflow is a YAML file under .github/workflows/. It declares which events trigger it and which jobs run when those events fire.\nA job is the unit that runs inside a workflow. Each job runs on its own runner (a virtual machine). Jobs in the same workflow run in parallel by default, and the needs keyword declares ordering when sequential execution is required.\nA step is a single line inside a job. It\u0026rsquo;s either a shell command (run) or a call to a reusable action (uses). Steps in the same job run sequentially on the same runner and share the workspace.\nOnce these three are clear, even a complex pipeline can be assembled reliably. A frequent point of confusion is that steps in different jobs do not share a workspace. Tasks that need the same environment must live in the same job, or they have to pass data via artifact upload and download.\nTrigger Events Actions can fire on nearly any event happening in the repository. The common ones:\nEvent When it fires Typical use push A push to any branch Deploy on push to main pull_request PR open / sync / close PR CI (test / lint / build) schedule Cron expression Periodic tasks (dependency checks, cleanup) workflow_dispatch Manual run (UI/CLI) Deployments, one-off tasks release A GitHub Release is created Package publish, changelog repository_dispatch Webhook from outside External trigger integration PR CI is the most common pattern: a pull_request trigger combined with lint/test/build jobs. The status check from this workflow is what gets wired into branch protection as the merge gate.\nRunners The runner is the actual machine where steps run. Two kinds.\nGitHub-hosted runners are virtual machines provided by GitHub. You pick from ubuntu, windows, or macos, and each run starts in a clean environment. Setup overhead is essentially zero, security isolation is built in, and most projects start here.\nSelf-hosted runners are runner daemons installed on your own infrastructure. They\u0026rsquo;re useful for large datasets, specialized hardware (GPUs), or when access to internal networks is required. The trade-off is that you take on responsibility for security isolation, maintenance, and OS patching.\nMatrix builds run the same steps in multiple environments in parallel. A combination like [ubuntu, macos] x [node-18, node-20, node-22] produces six parallel jobs. It\u0026rsquo;s the standard pattern for verifying that a library or CLI works across environments.\nReusing Actions A step like uses: actions/checkout@v4 pulls in an action from the marketplace or another repository. The most-used actions are nearly fixed.\nactions/checkout — pull repository code actions/setup-node, actions/setup-python, actions/setup-go — install runtimes actions/cache — cache dependencies actions/upload-artifact, actions/download-artifact — pass files between jobs For reusing your own step bundles, two options exist. A composite action packages a bundle of steps via action.yml. A reusable workflow lets one workflow be called from another. Composite actions fit small step bundles; reusable workflows fit reuse at the pipeline scale.\nVersion pinning is directly tied to security and stability. Major tags like @v4 are convenient, but the SHA they point to can change, making them a target for supply-chain attacks. Security-sensitive environments pin to @\u0026lt;full-sha\u0026gt; instead.\nSecrets and Permissions When a workflow needs to authenticate with external services, secrets are the mechanism.\nsecrets.GITHUB_TOKEN is auto-issued at the start of every workflow run. It carries baseline permissions for the repository and is used for pushes, PR comments, and issue comments.\nUser-defined secrets can be registered at the repository, environment, or organization level. Environment-scoped secrets pair with environment protection rules (required approvals, wait timers), making them a natural fit for production deployment jobs.\nThe permissions block narrows the token\u0026rsquo;s scope. The default is broad repository permissions, but specifying minimal permissions like contents: read at the workflow or job level limits the blast radius if a token is leaked. CI workflows often need only read access, so declaring it explicitly is a useful safety margin.\nCommon Patterns Day-to-day usage compresses into roughly four flows.\nPR CI: pull_request trigger with lint / test / build jobs. Wired to branch protection as a required check.\nDeploy on push to main: a push trigger filtered to main, running build → deploy. Staging deploys are usually automatic; production deploys typically combine environment protection rules with manual approval.\nRelease on tag push: filter push events with tags: ['v*'] to catch only tag pushes, then build and publish release artifacts.\nMatrix builds: verify that a library or CLI works across multiple environments.\nCommon Pitfalls Leaking secrets A step like echo $SECRET writes the value straight into the log. Actions tries to mask known secret patterns, but base64 encoding or partial leaks can bypass the mask. Never printing secrets is the simplest safety net.\nUnpinned action versions Pointing to a branch like uses: some-org/action@main means the contents can change at any time. There have been real incidents where a popular action was taken over and malicious code was pushed to main. Pinning to a major tag or, better, a full SHA is the standard.\nMissing cache Reinstalling dependencies on every run can add minutes to PR CI. actions/cache keeps directories like npm/pip/go modules so subsequent runs finish almost instantly. The cache key should include a hash of the lock file so changes invalidate the cache automatically.\nHeavy workflows queuing up If a self-hosted runner pool has only one machine, concurrent runs queue up and waiting times stretch. Setting up concurrency groups to auto-cancel previous runs from the same PR/branch, or scaling out runners, are the usual answers.\nWrap-up Actions automation rests on a three-layer abstraction.\nWorkflow / job / step: one workflow has many jobs; one job has many steps Triggers: repository events kick off workflows Runners: GitHub-hosted by default; self-hosted for specialized environments Secrets + permissions: minimal scopes and pinned action versions are the security floor The three-level abstraction (workflow / job / step) combined with event triggers and runners forms the GitHub Actions automation engine. Secret handling and least-privilege permissions form the boundary that keeps that engine safe to run.\nThe next article covers the work units that pair with code changes — JIRA Sprint workflows and the integration with Git and GitHub.\n","permalink":"https://wid-blog.github.io/en/posts/tech/devenv/github-actions-fundamentals/","summary":"GitHub Actions seen as an event-driven automation engine — the three-layer abstraction of workflow / job / step, plus the operational details of triggers, runners, and secrets.","title":"GitHub Actions Fundamentals — Workflow, Job, Step"},{"content":"A GitHub PR adds a collaboration layer on top of Git\u0026rsquo;s change graph. It is not just a screen with a merge button — it\u0026rsquo;s where change visibility, review decisions, and CI gating converge into a single unit.\nPull Request A PR is a bundle of diffs between a source branch and a target branch. GitHub wraps that bundle with the metadata collaboration needs — review system, conversation threads, status checks, branch protection — and the result is what we call a PR.\nflowchart LR Open[\"PR opened(diff bundle)\"] --\u003e Review[\"Review(comment / approve / request changes)\"] Review --\u003e Checks[\"CI status checks(test / build / lint)\"] Checks --\u003e Merge[\"merge\"] Merge --\u003e Close[\"close\"] These four steps usually flow naturally, but a poor unit of design will jam one of them. The most common jam is at review time.\nDesigning PR-Sized Units Small PRs get reviewed faster. The common rule of thumb is 200-400 lines of code, but more essential than the size is the principle of one intent per PR. Mixing a refactor with a new feature forces the reviewer to evaluate them separately, and as a result neither gets a thorough look.\nEven big changes can stay single-intent when broken into stages. Adding a new interface, migrating call sites, removing the old interface — that kind of staged split keeps each PR small and each review light.\nA draft PR is a tool for getting early feedback on unfinished work. You open the PR before it\u0026rsquo;s mergeable, gather opinions on the direction, and then continue the actual work. It\u0026rsquo;s especially useful as a checkpoint before committing to a large change.\nThe Code Review Cycle Review isn\u0026rsquo;t just a venue for judging code right or wrong. Intent verification, knowledge sharing, and merge-readiness checks all happen there together.\nAuthors look at their own PR before requesting review. Reading your own change with the eyes of a first-time reader catches the small mistakes that often slip in just before merge — leftover debug code, unrelated edits, missing tests.\nReviewers pick one of four actions.\ncomment: information, questions, suggestions suggestion: a small concrete change the author can apply directly request changes: something that must be addressed before merge approve: agreement that the PR can be merged The core decision here is the blocker vs. nit distinction. Things that affect intent, correctness, or safety should be flagged as blockers; style or minor preference differences should be marked clearly as nits. When every comment carries the same weight, review gets heavier and real risks get buried among the noise.\nAfter a reviewer leaves comments, the author either responds or makes the change and resolves the thread. Once enough threads are resolved, the PR reaches a mergeable state.\nMerge Strategies GitHub offers three options for merging a PR into main.\nStrategy Graph Result Trait Merge commit Branch-and-converge preserved PR boundary remains visible in graph Squash merge Linear; one PR becomes one commit Simple history, internal PR commits gone Rebase merge Linear; PR commits preserved as-is Linear history, PR boundary blurred flowchart TB subgraph Source [\"Inside the PR\"] s1((c1)) --\u003e s2((c2)) --\u003e s3((c3)) end subgraph MergeCommit [\"merge commit\"] m1((m1)) --\u003e m2((m2)) --\u003e mc((merge)) m1 --\u003e mc1((c1)) --\u003e mc2((c2)) --\u003e mc3((c3)) --\u003e mc end subgraph Squash [\"squash merge\"] sq1((m1)) --\u003e sq2((m2)) --\u003e sq3((squashed)) end subgraph Rebase [\"rebase merge\"] r1((m1)) --\u003e r2((m2)) --\u003e rc1((c1')) --\u003e rc2((c2')) --\u003e rc3((c3')) end The team convention is essentially a choice about main\u0026rsquo;s graph shape. If you want each PR to appear as one clean unit, squash merge is the simplest. If the commits inside a PR are meaningful steps (refactor → feature → cleanup), rebase merge preserves that flow. If the merge flow itself is valuable as a record of collaboration, merge commits are the natural choice.\nConsistency matters more than the choice itself. A main branch that mixes squashes with merge commits ends up partly linear and partly branched in an awkward shape.\nCI Gating and Branch Protection PRs that merge on human review alone tend to leak regressions. Status checks are where automated verification belongs.\nGitHub watches the result of every status check (test, build, lint) tied to the PR. A branch protection rule on main can require specific checks to pass before merge, which prevents a broken PR from reaching main even if a reviewer accidentally clicks merge.\nPairing this with required-reviewer counts, codeowner auto-assignment, and required-up-to-date-with-main settings closes most of the gaps that show up at the PR stage.\nCommon Pitfalls Oversized PRs A PR with thousands of lines gets reviewed in name only. The reviewer leaves an \u0026ldquo;overall LGTM\u0026rdquo; and real risks slide through. Big work needs to be split by intent for review to retain substance.\nBikeshedding This is the pattern of spending review cycles on trivial preferences (indentation, name flavor) unrelated to the substance of the code. The cleanest fix is to remove these from review entirely by enforcing them with linters and formatters.\nAuto-LGTM When approval becomes a reflex, review becomes ceremony. A checklist (test coverage, intent change, security impact) or required multi-reviewer rules for large PRs are common safeguards that keep review honest.\nAccumulating Merge Conflicts The longer a PR stays open, the more it diverges from main, and the bigger the eventual conflict. Small PRs and fast reviews are the simplest answer; if that isn\u0026rsquo;t workable, frequent rebases against main keep the gap small.\nWrap-up GitHub PRs add a unit of collaboration on top of the Git graph.\nPR unit: one intent, broken into small pieces Code Review: separate blockers from nits; authors review their own PRs first Merge strategy: stay consistent within team convention CI gating: tie automated checks to merge requirements via branch protection In the end, a PR\u0026rsquo;s value comes from the combination of four pieces: unit design (one intent), Code Review (blocker / nit distinction), merge strategy (consistency), and the CI gate. If any one of them loosens, the PR gets stuck at the final step.\nThe next article digs into that automated verification itself — how GitHub Actions builds, tests, and deploys PRs.\n","permalink":"https://wid-blog.github.io/en/posts/tech/devenv/github-pr-and-code-review/","summary":"Looking at GitHub PRs as a collaboration layer on top of Git\u0026rsquo;s change graph, and walking through the Code Review cycle, PR-level design, and merge strategies.","title":"GitHub PRs and the Code Review Cycle"},{"content":"Git is something we use every day, yet the reasoning behind workflow decisions often stays vague. \u0026ldquo;Rebase keeps history cleaner than merge\u0026rdquo; is a phrase everyone has heard, but what that means precisely — and when not to use rebase — tends to remain fuzzy without a deliberate write-up.\nGitHub\u0026rsquo;s PR and Code Review are split off into a later article.\nCommit Everything in Git centers on the commit. A commit is one node in a graph, and it points to a parent commit to form history. A branch is just a label drawn on top of those nodes, and a merge is the point where two flows meet.\nflowchart LR A((A)) --\u003e B((B)) --\u003e C((C)) C --\u003e D((D)) C --\u003e E((E)) D --\u003e F((F)) E --\u003e F The shape of this graph determines the cost of future debugging. When git bisect chases a regression, when git blame traces intent, clean commit boundaries cut down the search space significantly.\nCommit Hygiene A good commit carries a single intent. Mixing a refactor with an unrelated feature in one commit means later, when only one of the two needs to be reverted, the work has to be split apart by hand.\nThe message captures that intent. The subject line stays under about 50 characters, and the body — separated by a blank line — explains the why when needed. The what is already visible in the diff, so what makes the message valuable is the reasoning behind the change.\nConventions like Conventional Commits classify messages with prefixes such as feat:, fix:, and chore:. They pair well with toolchains that auto-generate changelogs or decide semantic versions.\nBranch A branch is a label that points to a commit. When a new commit is made, the current branch\u0026rsquo;s pointer moves one step to the new commit. git checkout simply moves HEAD to a different branch — nothing more. There\u0026rsquo;s no separate directory created on disk and no heavy work involved.\nThat simplicity is what makes branching strategies practical.\nBranching Strategies Three patterns are commonly compared.\nStrategy Flow When it fits Trunk-based Short feature branches (hours to days), frequent merges into main Mature CI/CD, fast deploys GitHub Flow Feature branch → PR → merge to main → deploy SaaS, continuous deployment GitFlow main / develop / feature / release / hotfix layered separation Long release cycles, strict version management Trunk-based stays simple and integrates often, so merge conflicts shrink. GitFlow gives strict release control at the cost of more branches and operational overhead. GitHub Flow lands in between and fits most SaaS workflows well.\nThe core question is: how often can we integrate into main? The more frequent the integration, the simpler the graph and the smaller each conflict.\nMerge vs Rebase When combining two flows, Git offers two options.\nflowchart TB subgraph Before [\"Diverged\"] m1((m1)) --\u003e m2((m2)) --\u003e m3((m3)) m1 --\u003e f1((f1)) --\u003e f2((f2)) end subgraph Merge [\"After merge\"] mm1((m1)) --\u003e mm2((m2)) --\u003e mm3((m3)) --\u003e mc((merge)) mm1 --\u003e mf1((f1)) --\u003e mf2((f2)) --\u003e mc end subgraph Rebase [\"After rebase\"] rm1((m1)) --\u003e rm2((m2)) --\u003e rm3((m3)) --\u003e rf1((f1')) --\u003e rf2((f2')) end Merge combines two histories into a new commit (a merge commit). The graph keeps the divergence and convergence visible — when, by whom, and where things were combined.\nRebase replays one side\u0026rsquo;s commits on top of the other end. The graph becomes linear, but every replayed commit gets a new hash (f1 becomes f1'). They are, fundamentally, different commits.\nThe rule is clear: never rebase shared commits. If a teammate already has those commits in their branch, rebasing followed by a force push wrecks their history. On a local branch or a feature branch you have not shared yet, rebasing to clean up history is safe.\nThe policy for main itself is a team convention. Teams that prefer linear history use rebase or squash; teams that want the merge flow preserved use merge commits. There is no universally right answer — what matters more is consistency once a choice is made.\nCommon Pitfalls git push --force overwriting a teammate\u0026rsquo;s commits Force-pushing a shared branch can erase commits a teammate added after the rewritten point. --force-with-lease rejects the push when the remote has moved beyond what you saw locally — a built-in safety net.\nConflicts on every commit during rebase Rebase replays commits one at a time, so conflicts surface per commit. If resolving them once is preferable, falling back to a merge or finishing the rebase and then squashing with git rebase -i are the practical options.\nLost work after git reset --hard reset --hard resets both the working tree and the index. Committed work survives in the reflog — git reflog finds the SHA, and git reset --hard \u0026lt;sha\u0026gt; restores it. Uncommitted changes are harder to recover, so a habit of making a temporary commit before risky operations is a useful safety net.\nMerge commits forming a tangled web When small feature branches merge often into main, the graph stays clean. When long-lived branches keep merging into each other, the graph turns into a tangle. Squash merges or rebase merges can clean it up, but the deeper fix is to reduce the number of long-lived branches in the first place.\nWrap-up Git workflow decisions reduce to choices about graph shape.\nCommit: one intent per commit, the why in the message Branch: just a pointer; integration frequency drives the strategy Merge vs Rebase: don\u0026rsquo;t rebase shared commits; beyond that, follow team convention In the end, commit hygiene, branch strategy, and merge·rebase choices add up to the shape of the graph. The simpler that shape stays, the lower the cost of later debugging and collaboration.\nThe next article picks up from here: how this graph is collaborated on through GitHub PRs — the Code Review cycle and merge strategies.\n","permalink":"https://wid-blog.github.io/en/posts/tech/devenv/git-workflow-basics/","summary":"Looking at Git as a graph of changes — and seeing how commit hygiene, branching strategy, and the merge-vs-rebase choice are all decisions about the shape of that graph.","title":"Git Workflow Basics — Commits, Branches, Merge vs Rebase"},{"content":"Traffic inside a VPC passes through two layers of defense before reaching an instance — NACL at the Subnet boundary and Security Group at the instance boundary. They look similar on the surface, like \u0026ldquo;firewall rules,\u0026rdquo; but they differ in scope, evaluation, and statefulness. Without that distinction, traps such as \u0026ldquo;outbound is allowed but the response is blocked\u0026rdquo; show up surprisingly often.\nSecurity Groups A Security Group (SG) is a set of rules attached to an instance or an ENI (Elastic Network Interface). Multiple SGs can attach to one instance; in that case, rules apply as a union.\nThe most distinctive trait is that SGs are stateful. Once a connection is allowed, return traffic is permitted automatically — there\u0026rsquo;s no need to mirror an inbound rule with a matching outbound rule for replies.\nSGs are allow-only. There is no deny rule; only what you explicitly allow gets through. The defaults are all-outbound-allowed, all-inbound-denied. A new SG starts with everything blocked from outside, and you add ports and sources one by one as needed.\nThe source/destination of an SG rule can be a CIDR or another SG\u0026rsquo;s ID. Inside the same VPC, this lets you express rules like \u0026ldquo;only resources tagged with this SG can connect,\u0026rdquo; so rules don\u0026rsquo;t break when an instance\u0026rsquo;s IP changes.\nNACLs A NACL (Network ACL) attaches at the Subnet level. One NACL per Subnet, and every resource in that Subnet is governed by the same NACL.\nNACLs are stateless. Inbound and outbound rules are completely independent — return traffic for an allowed connection still requires its own explicit rule on the other side. What an SG handles automatically, a NACL requires you to write out.\nNACLs support both allow and deny rules. Rules are numbered, evaluated in ascending order, and the first match wins. When you need to block specific traffic explicitly, NACLs are the place to do it.\nThe defaults split into two cases. The default NACL created with a VPC allows everything, while a custom NACL you create blocks everything to start.\nEvaluation Order Two layers sit between an outside packet and a workload, and they evaluate in the following order:\nflowchart LR Ext[\"External\"] --\u003e|\"inbound\"| NACL_in[\"NACL(Subnet inbound)\"] NACL_in --\u003e SG_in[\"SG(Instance inbound)\"] SG_in --\u003e VM[\"VM\"] VM --\u003e SG_out[\"SG(Instance outbound)\"] SG_out --\u003e NACL_out[\"NACL(Subnet outbound)\"] NACL_out --\u003e|\"outbound\"| Ext2[\"External\"] Inbound traffic from outside passes the NACL at the Subnet boundary first, then is evaluated again at the instance boundary by the SG. Responses go the other way — through the SG outbound, then the NACL outbound. Both layers must permit the traffic for it to make the round trip.\nSGs are stateful, so the return path is automatic from the instance side. NACLs are stateless, so the response leg also has to be allowed explicitly. That gap is where the common troubleshooting pitfalls live.\nCommon Pitfalls Outbound allowed, response blocked The NACL allows outbound, but the inbound rule does not allow the destination port of the response. Replies typically arrive on ephemeral ports (1024-65535), and if that range is not covered in NACL inbound, responses get dropped.\nThis issue is invisible in environments using only SGs because they\u0026rsquo;re stateful. It surfaces the moment you introduce a NACL into the picture. The fix is to allow ephemeral port ranges in both directions of the NACL.\nSG rules look like they aren\u0026rsquo;t taking effect If a rule was added but traffic still seems blocked, the more likely cause is the NACL — not another SG conflicting with this one. SGs are per-instance, so allow rules on one SG aren\u0026rsquo;t undone by another SG. NACLs apply Subnet-wide as a single ruleset.\nDefault vs custom NACL defaults The default NACL starts with everything allowed and you remove specific traffic from there. A custom NACL starts with everything denied and you allow specific traffic. Swapping a default NACL for a custom one without re-adding allow rules cuts off all traffic to the Subnet — a common foot-gun.\nComparison The differences in one table:\nAspect Security Group NACL Scope Instance / ENI Subnet Statefulness Stateful (return traffic auto-allowed) Stateless (return needs its own rule) Rule types Allow only Allow + Deny Evaluation Union of all rules Numbered order, first match wins Defaults Inbound denied, outbound allowed Default NACL: all allowed / custom: all denied Source expression CIDR or another SG ID CIDR only Vendor Naming Map Concept AWS GCP Azure Alibaba Cloud Per-instance, stateful Security Group Firewall Rule (target tag) NSG (NIC-attached) Security Group Per-subnet, stateless Network ACL (no direct equivalent) NSG (Subnet-attached) Network ACL GCP doesn\u0026rsquo;t separate SG and NACL — it has a unified Firewall Rules model with targets such as instance tags, network tags, or service accounts. The mental model is structurally one step different from AWS, Azure, and Alibaba.\nAzure\u0026rsquo;s NSG is the same resource regardless of whether it attaches at the NIC or Subnet level, which is closer to merging AWS\u0026rsquo;s SG and NACL into one resource.\nSeries Wrap-up This closes out the VPC fundamentals series. Four elements combine into the single abstraction called a VPC.\nIsolation (Article 1) — IP space, Subnets, and Tenancy simulating a private network boundary Routing (Article 2) — Route Tables deciding traffic paths, IGW and NAT as two kinds of exits Connectivity (Article 3) — VPC Peering, Transit Gateway, VPN, and PrivateLink for external connections Security (Article 4) — Security Groups and NACLs forming two layers of defense Vendor names differ, but the abstractions are nearly the same. Build a mental model in one vendor, and moving to another doesn\u0026rsquo;t disorient.\n","permalink":"https://wid-blog.github.io/en/posts/tech/infra/vpc-security-fundamentals/","summary":"How Security Groups (stateful, per-instance) and NACLs (stateless, per-subnet) form different layers of defense in a VPC, plus the common pitfalls each surface.","title":"Security Groups and NACLs"},{"content":"Once routing inside a VPC is settled, the next question is how a VPC connects beyond itself — to other VPCs, on-premises data centers, and external SaaS services.\nVPC Peering, Transit Gateway, Site-to-Site VPN, and PrivateLink each have their own topology, cost model, and operational trade-offs, and the choice between them shapes the whole system.\nVPC Peering VPC Peering is the simplest option. It connects two VPCs directly so that each can reach the other\u0026rsquo;s private IP space.\nPeering works across the same region and same account, but also across regions and accounts. Its topology is a 1:1 mesh, so connecting three VPCs to each other requires three peerings (A-B, B-C, and A-C).\nA key limitation is that transitive routing is not supported. Even with A-B and B-C in place, A cannot reach C through B. Both ends must have a direct peering. As the number of VPCs N grows, the number of peerings required scales as N(N-1)/2, which makes Peering a fit only for a handful of VPCs.\nTransit Gateway Once the number of VPCs starts to grow, Peering\u0026rsquo;s mesh structure quickly becomes a burden. Transit Gateway addresses this limit.\nflowchart LR subgraph Peering [\"Peering: mesh\"] VA[\"VPC A\"] --- VB[\"VPC B\"] VB --- VC[\"VPC C\"] VA --- VC VC --- VD[\"VPC D\"] VA --- VD VB --- VD end subgraph Transit [\"Transit Gateway: hub-spoke\"] TGW((\"TGW\")) TVA[\"VPC A\"] --- TGW TVB[\"VPC B\"] --- TGW TVC[\"VPC C\"] --- TGW TVD[\"VPC D\"] --- TGW end Transit Gateway acts as a central hub. Multiple VPCs attach as spokes, and traffic between any two spokes flows transitively through the hub. The number of attachments grows linearly with the number of VPCs.\nThe cost model differs from Peering. Transit Gateway charges per attachment hour plus per byte processed, which makes it more expensive than Peering at small scale. As N grows, the cost balance flips. The option to split routing into multiple route domains also makes it useful for multi-tenant setups that need isolation.\nSite-to-Site VPN Site-to-Site VPN is usually the first option that comes up when a VPC needs to reach an on-premises data center. It builds an IPSec tunnel over the public internet, logically connecting the two networks.\nBoth static routing and BGP dynamic routing are supported. Because the underlying transport is the public internet, bandwidth and latency are variable, which can be a constraint for mission-critical traffic. When more stable connectivity is required, dedicated-line options like Direct Connect or Cloud Interconnect exist as separate alternatives.\nPrivateLink While the previous three rely on IP routing, PrivateLink is structurally different. Instead of stitching two networks at the IP/CIDR level, it exposes services at the service level.\nThe service provider creates an endpoint inside its VPC, and the consumer side sees that endpoint as an ENI (Elastic Network Interface) or equivalent inside its own VPC. The IP layout of either VPC is irrelevant — endpoint-level connection sidesteps any CIDR collision.\nDirection matters too. PrivateLink is one-way: the consumer calls into the service the provider exposed, and the reverse direction would need a separate endpoint. It is commonly used for SaaS services, internal service exposure, and managed-service entry points into a VPC.\nComparison Lining up the four mechanisms makes the trade-offs explicit.\nMechanism Topology Transitive Cost Model Primary Use VPC Peering 1:1 mesh ❌ Lower (per byte) Few VPCs, direct Transit Gateway Hub-spoke ✅ Per hour + per byte Many VPCs, route-domain split Site-to-Site VPN Site ↔ VPC tunnel (BGP: ✅) Per hour + per byte On-prem ↔ VPC PrivateLink Service endpoint (n/a) Per endpoint + per byte Service exposure, CIDR-agnostic CIDR Collision Pitfalls VPC Peering and Transit Gateway are both IP-routing-based. If two VPCs use overlapping CIDRs, packet destinations become ambiguous and routing breaks down.\nCarving up CIDR ranges without overlap from day one avoids the operational cost that surfaces at the connectivity stage. Allocating IP ranges at the organization level pays off especially in environments with many VPCs.\nFor environments where collisions already exist, PrivateLink is the workaround. Because it does not rely on IP routing, two VPCs with the same CIDR can still talk to each other through service endpoints. The trade-off is that PrivateLink only fits the narrow pattern of \u0026ldquo;exposing a specific service,\u0026rdquo; not broad IP-level communication.\nVendor Naming Map The four mechanisms across vendors:\nConcept AWS GCP Azure Alibaba Cloud 1:1 direct VPC Peering VPC Network Peering VNet Peering VPC Peering Hub-spoke at scale Transit Gateway Network Connectivity Center Virtual WAN CEN (Cloud Enterprise Network) On-prem IPSec Site-to-Site VPN Cloud VPN VPN Gateway VPN Gateway Service-level exposure PrivateLink Private Service Connect Private Link PrivateLink The names shift slightly, but the abstractions are nearly the same. AWS\u0026rsquo;s PrivateLink, GCP\u0026rsquo;s Private Service Connect, and Azure\u0026rsquo;s Private Link all share the same service-endpoint model.\nWrap-up External connectivity for a VPC reduces to one of four mechanisms.\nVPC Peering: 1:1 mesh, no transitive routing, fits a small number of VPCs Transit Gateway: hub-spoke, transitive, cost balance flips in favor at scale Site-to-Site VPN: IPSec tunnel between an on-premises site and a VPC PrivateLink: service-endpoint exposure that is independent of IP routing Which mechanism is chosen effectively decides the topology, cost, and isolation policy of how a VPC connects beyond itself. And because three of the four rely on IP routing, a CIDR collision can render that choice unusable — which is why allocating IP ranges at the organization level from the start is what holds the choice together.\nThe next article covers security — how Security Groups and NACLs build different layers of defense on top of a VPC.\n","permalink":"https://wid-blog.github.io/en/posts/tech/infra/vpc-connectivity-fundamentals/","summary":"Comparing the four mechanisms that connect a VPC to other VPCs, on-premises networks, and external services — Peering, Transit Gateway, Site-to-Site VPN, and PrivateLink — across topology and cost.","title":"Connecting VPCs to Other Networks — Peering, VPN, Transit, PrivateLink"},{"content":"Integrating external SSPs increases traffic and revenue. But rising revenue does not mean rising profit.\nWhen an ad request arrives, the recommendation server generates ad candidates and the filtering server removes unsuitable ads. Media fees are paid to the external SSP for each impression. Add the server costs for these processing steps to the media fees, and some inventories cost more than they earn. As server cost share grew, contribution margin was shrinking.\nThe decision was to build a system that automatically identifies low-performing inventory and throttles its traffic.\nPerformance Metric Analysis Determining which inventories underperform required clear criteria. Three candidates were analyzed.\nImp Cost Ratio. The ratio of media cost to revenue. Above 100% means media cost exceeds revenue — a net loss from the contribution margin perspective, even if revenue is generated. This was the most intuitive metric.\nRPM. Revenue per 1,000 impressions. Some inventories had low RPM but were still profitable. Lower priority than Imp Cost Ratio. In practice, Imp Cost Ratio alone produced enough throttling candidates, so RPM was excluded.\nWin Ratio. The proportion of SSP bids won. A low Win Ratio means ads are prepared but never served — server resources consumed with no revenue generated. Useful as a supplementary metric for server cost reduction.\nImp Cost Ratio was set as the primary criterion, with Win Ratio as a secondary supplement.\nFirst Approach How to throttle traffic for inventories with Imp Cost Ratio above 100% was examined.\nTwo methods were compared. The first applied weighted throttling based on Imp Cost Ratio and impression share — higher ratio and higher impression volume meant more aggressive throttling. The second applied a fixed throttling rate to all target inventories. Simple but reliable.\nBoth methods were simulated in Redash. The simulation showed traffic decreasing for inventories where media cost exceeded revenue.\nBut the approach was not applied. A limitation surfaced through discussion. The project goal was contribution margin improvement, while Imp Cost Ratio only reflects media cost. It ignores server costs. Some inventories were profitable by media cost alone but unprofitable when server costs were included. The full picture of contribution margin was missing.\nSecond Approach A comprehensive profitability metric that included server costs was needed. A predicted contribution margin rate was introduced.\nMeasuring server cost per inventory directly is difficult. The approach was to allocate by impression-based contribution.\nFrom this, revenue and cost items are combined to calculate a predicted contribution margin rate. A negative value means the inventory is losing money from a contribution margin perspective.\nThrottle Rate Calculation The contribution margin rate needed to be converted into a traffic throttle rate. Several functions were compared.\nFunctions that react sharply at the early stage were deemed too aggressive. The goal was to improve contribution margin while minimizing revenue impact, so a graduated correction shape was chosen. The threshold was made configurable externally, allowing flexible control over the scope of throttled inventories.\nBatch Architecture Server cost data is aggregated daily, so the batch runs on a daily cycle.\nPer-inventory ad performance (impressions, media cost, revenue) is queried. Per-service server costs are queried. The two datasets are combined to calculate per-inventory predicted contribution margin rate, derive the throttle rate, and store it in the database. The ad server references each inventory\u0026rsquo;s throttle rate when handling incoming requests.\nRevenue alone hides important details. Recognizing that revenue was rising while profit was falling was the starting point.\nThe first approach looked only at media cost. The simulation looked promising, but server costs were missing. It was not applied; the second approach took over, incorporating server costs and revealing the full contribution margin picture. Catching the gap between metric and goal before any rollout — and advancing the approach by a step — was the most valuable learning from this project.\nContribution margin improved meaningfully.\n","permalink":"https://wid-blog.github.io/en/posts/career/dable/profitability-based-traffic-throttling/","summary":"Retrospective on building a system that automatically identifies low-performing SSP inventory and throttles traffic to improve contribution margin. Covers the evolution from Imp Cost Ratio to a predicted contribution margin rate approach.","title":"Profitability-Based Traffic Throttling Retrospective"},{"content":"Where traffic inside a VPC goes is decided by Route Tables. Exits to the public internet are split between two gateways: IGW handles bidirectional traffic, NAT handles outbound only.\nRoute Tables A Route Table is a set of routing rules attached to a VPC or a Subnet. Each rule pairs a destination CIDR with a next hop (target), telling traffic where to go.\nMatching is by longest prefix. A more specific CIDR rule wins. If both 0.0.0.0/0 (the default route) and 10.0.5.0/24 exist and a packet\u0026rsquo;s destination is 10.0.5.42, the 10.0.5.0/24 rule is chosen.\nA Local route is added automatically when a VPC is created. It points to the VPC\u0026rsquo;s full CIDR, so resources inside the same VPC can communicate without any extra configuration. The Local route cannot be deleted.\nflowchart LR Pkt[\"Packetdestination: 8.8.8.8\"] --\u003e RT[\"Route Table\"] RT --\u003e|\"10.0.0.0/16(local)\"| Local[\"Inside VPC\"] RT --\u003e|\"0.0.0.0/0(default)\"| Out[\"IGW or NAT\"] Internet Gateway (IGW) The Internet Gateway is the component responsible for bidirectional traffic between a VPC and the public internet. A VPC can attach only one IGW, and external traffic flows only when one is attached.\nReaching a resource from outside requires two conditions together. The resource must have a Public IP or Elastic IP attached, and the Subnet it sits in must have a Route Table with a default route pointing to the IGW. Satisfying only one is not enough.\nNAT Gateway A NAT Gateway is an outbound-only exit. It is used when resources in a Private Subnet need to reach the public internet but must not be reachable from the outside directly.\nThe NAT Gateway itself sits in a Public Subnet, because NAT eventually has to push traffic out through the IGW. The Private Subnet\u0026rsquo;s Route Table points its default route at the NAT Gateway, and the NAT translates traffic onto its own Public IP before forwarding it.\nConnections initiated from outside cannot pass through the NAT, so Private resources interact with the outside only outbound. The NAT\u0026rsquo;s asymmetry becomes the security benefit itself. Note that NAT Gateways are billed per hour and per byte, which makes them a cost concern for outbound-heavy workloads.\nPublic Subnet vs Private Subnet Public Subnet and Private Subnet are not properties of the Subnet itself. They are the result of routing rules.\nPublic Subnet: a Subnet whose Route Table has a default route pointing to the IGW Private Subnet: a Subnet whose default route points to a NAT, or has no default route at all flowchart LR subgraph Public [\"Public Subnet\"] VM_P[\"VM (Public IP)\"] -.-\u003e RT_P[\"Route Table0.0.0.0/0 → IGW\"] end subgraph Private [\"Private Subnet\"] VM_R[\"VM\"] -.-\u003e RT_R[\"Route Table0.0.0.0/0 → NAT\"] end RT_P --\u003e IGW[\"IGW\"] RT_R --\u003e NAT[\"NAT Gateway\"] NAT --\u003e IGW Two Subnets in the same VPC are simply attached to different Route Tables — there is no Public or Private flag on the Subnet itself.\nCommon Routing Pitfalls A Public IP exists, but the resource is not reachable A Public IP is not a sufficient condition for reachability. The Subnet hosting the resource must have a Route Table with a default route pointing to the IGW. Attaching a Public IP to an instance in a Private Subnet does not make it reachable from outside.\nSame-VPC traffic works even with an empty Route Table The Local route is added automatically and cannot be removed. Resources within the same VPC can always talk to each other, regardless of other rules.\nWhy NAT is deployed in every AZ A NAT Gateway is a per-AZ resource. If one AZ hosts a NAT and Private Subnets in another AZ point at it, an outage in the NAT\u0026rsquo;s AZ takes outbound traffic from the other AZ down with it. Systems that care about availability deploy a NAT per AZ. The cost rises in proportion.\nVendor Naming Map Routing components by vendor:\nConcept AWS GCP Azure Alibaba Cloud Routing rules Route Table Routes Route Table Route Table External exit (bidirectional) Internet Gateway (default internet gateway, implicit) Public IP + NSG Internet Gateway Outbound-only exit NAT Gateway Cloud NAT NAT Gateway NAT Gateway GCP differs in that the default internet gateway is not exposed as an explicit resource — it shows up only as a next hop in Routes. Azure has no separate Internet Gateway resource; external exposure is governed by Public IPs and NSG rules.\nWrap-up Three elements decide the path traffic takes inside a VPC.\nRoute Table: a set of rules attached to a VPC or Subnet. Longest prefix match and the Local route are provided by default. IGW: a bidirectional exit to the public internet. Reachability requires both a Public IP and a default route in the Subnet\u0026rsquo;s Route Table. NAT Gateway: an outbound-only exit. It sits in a Public Subnet and pushes traffic out through the IGW. Public Subnet vs Private Subnet is not a property of the Subnet itself — it is the result of where the default route points.\nRoute Table rules ultimately decide internal communication, external egress, and the Public vs Private split. IGW and NAT only provide the kinds of exits; which destination ends up at which exit is still determined by the rules.\nThe next article covers how a VPC connects to other networks — the topologies that Peering, Transit Gateway, VPN, and PrivateLink make possible.\n","permalink":"https://wid-blog.github.io/en/posts/tech/infra/vpc-routing-fundamentals/","summary":"How Route Tables decide traffic paths inside a VPC, the role of Internet Gateway and NAT Gateway as external exits, and the actual meaning of Public/Private Subnet.","title":"VPC Traffic Flow with Route Tables"},{"content":"When you launch a VM in the cloud, it automatically lands inside some network. That network is the VPC. AWS, GCP, and Alibaba all call it VPC; only Azure uses a different name — VNet. The abstraction they refer to is the same — an isolated virtual network with a private IP space sitting on top of public cloud infrastructure.\nBecause the same abstraction wears different names, jumping between vendor docs makes it easy to lose your bearings. Routing, connectivity, and security come in later posts.\nWhy VPC Exists Cloud is fundamentally a multi-tenant environment. Workloads from many customers run side by side on the same physical infrastructure. Without isolation, one customer\u0026rsquo;s traffic could be visible to another, and IP addresses would collide.\nVPC solves this with SDN-based virtual networks. Each VPC has its own IP space and its own route table, and is isolated from other VPCs by default. From the user\u0026rsquo;s perspective, it\u0026rsquo;s like spinning up your own data center on top of the cloud.\nThis isolation comes from three elements combined — IP space, Subnet, and Tenancy.\nIP Space (CIDR) The first thing you decide when creating a VPC is the CIDR block. Specify a private IP range like 10.0.0.0/16 and the addresses inside that range will be allocated to resources within the VPC.\nThe recommendation is to stay inside the private ranges defined by RFC 1918 — 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16. These ranges are not routable on the public internet, so they don\u0026rsquo;t collide with external addresses, and other VPCs are free to use the same ranges.\nThe catch comes when you want to connect VPCs later. Two VPCs both on 10.0.0.0/16 will produce ambiguous routes when wired together via Peering or Transit Gateway. Carving up the private space at the organization level from day one is the safer move.\nThe prefix length is also hard to change after the fact, so picking a generous space with future growth in mind pays off.\nWhat IP space contributes to isolation is straightforward: only resources inside the same VPC interpret the same IP space meaningfully, while other VPCs are separated at the IP level from the start.\nSubnet Once the VPC has its IP space, you split that space into smaller blocks. Those blocks are Subnets.\nSubnets are typically created per Availability Zone (AZ). Resources in the same AZ go into one Subnet; resources in a different AZ go into a different Subnet. Since AZs are physically separate data center clusters, splitting by AZ is the foundation of availability — if one AZ fails, Subnets in other AZs are unaffected.\nflowchart LR subgraph VPC [\"VPC (10.0.0.0/16)\"] subgraph AZ_a [\"AZ-a\"] S_a1[\"Public Subnet10.0.1.0/24\"] S_a2[\"Private Subnet10.0.2.0/24\"] end subgraph AZ_b [\"AZ-b\"] S_b1[\"Public Subnet10.0.3.0/24\"] S_b2[\"Private Subnet10.0.4.0/24\"] end end From the isolation perspective, Subnets act less as a separation boundary themselves and more as the smallest unit at which policies apply. Routing rules and security rules attach at the Subnet level, and the Public Subnet vs. Private Subnet distinction emerges from those routing rules.\nTenancy Even within the same VPC, whether a resource shares physical hardware with other customers is decided by the Tenancy option. Tenancy can be set as a VPC default or specified per instance.\nThe default is shared tenancy. Multiple customers\u0026rsquo; VMs run together on the same physical host, which keeps cost low and is sufficient for typical workloads.\nChoosing dedicated tenancy means resources in that VPC do not share physical hardware with anyone else. It\u0026rsquo;s typically used when compliance demands are strong — financial regulation, medical data, government workloads. Cost goes up and the available instance types are more limited.\nVendors name the option differently. AWS splits it into dedicated and host; GCP exposes a separate concept called sole-tenant nodes. The pattern is the same regardless of naming — pushing isolation further costs more.\nTenancy adds physical-hardware-level isolation on top of IP-level isolation, raising the strength of isolation by another step.\nVendor Naming Map The core terms covered so far, mapped across vendors:\nConcept AWS GCP Azure Alibaba Cloud Virtual private network VPC VPC Virtual Network (VNet) VPC Subnet Subnet Subnet Subnet VSwitch Private IP range CIDR Block IP range Address space CIDR Block Availability zone Availability Zone Zone Availability Zone Zone Dedicated hardware Dedicated / Host Sole-tenant Dedicated Host Dedicated Host The names that stand out the most are Azure\u0026rsquo;s VNet and Alibaba\u0026rsquo;s VSwitch. The rest stay close to the same vocabulary. Build the mental model in one vendor and at least half of any other vendor\u0026rsquo;s docs will already feel familiar.\nCompared to Pre-Cloud Isolation The idea of an isolated private network predates the cloud. On-prem data centers split traffic with VLANs and used RFC 1918 private ranges; this pattern is decades old.\nVPC reimplements that pattern on top of SDN. The dependency on physical-layer concepts like VLAN tags is gone, and abstractions like IP space / Subnet / Tenancy are now operable through APIs and code. Spinning up a new Subnet takes one API call rather than cable work. The substance of isolation is unchanged, but operational flexibility is on a different level.\nWrap-up VPC isolation is the combination of three elements:\nIP space (CIDR): the private IP address space scoped to a VPC. Stay inside RFC 1918 ranges and pick a generous size with future connectivity in mind. Subnet: the block that splits the VPC per Availability Zone. AZ-level separation is the basis of availability, and the unit at which routing and security policies apply. Tenancy: whether physical hardware is shared. Shared is the default; dedicated is for stronger compliance demands. Vendor names differ, but the abstraction is the same — VPC, VNet, and VSwitch are different labels on the same concept.\nIsolation does not come from any single element alone. IP space handles address collisions, Subnet handles availability and policy attachment, Tenancy handles physical hardware. How far to push isolation is decided by what the workload demands.\nThe next article covers traffic flow inside a VPC — the routing structure that Route Tables, Internet Gateways, and NAT Gateways build together.\n","permalink":"https://wid-blog.github.io/en/posts/tech/infra/vpc-isolation-fundamentals/","summary":"How VPC simulates a private network boundary by combining IP CIDR, Subnet, and Tenancy. Includes vendor naming map across AWS / GCP / Azure / Alibaba.","title":"VPC and the Isolation Model"},{"content":"\u0026ldquo;Which stocks to buy\u0026rdquo; and \u0026ldquo;when to buy and sell\u0026rdquo; are different questions. The indicators covered in previous posts — PER, ROE, momentum, dividend yield — answer the first. They evaluate a company\u0026rsquo;s value and quality. Answering the second requires reading price movements directly. Technical indicators serve that purpose.\nTechnical Indicators RSI (Relative Strength Index) RS = Average Gain / Average Loss RSI = 100 - (100 / (1 + RS)) Compares the magnitude of price gains versus losses over a period (typically 14 days). Produces a value between 0 and 100.\nRSI \u0026lt; 30 indicates an oversold condition. The decline has been large enough that a rebound is possible. Read as a buy signal. RSI \u0026gt; 70 indicates an overbought condition. The rise has been large enough that a decline is possible. Read as a sell signal.\n30/70 are not absolute thresholds. In strong uptrends, RSI can stay above 70 and keep rising. Adjusting to 20/80 or combining with other indicators helps reduce false signals.\nSMA Cross (Moving Average Crossover) SMA(N) = Average of closing prices over the last N days The Simple Moving Average (SMA) is the arithmetic mean of closing prices over N days. Crossovers between short-term and long-term SMAs identify trend reversals.\nGolden cross. The short-term SMA crosses above the long-term SMA. Signals an uptrend. Buy signal. Death cross. The opposite. The short-term crosses below the long-term. Signals a downtrend. Sell signal.\nCommon period combinations are 5/20 (short-term), 20/60 (medium-term), and 50/200 (long-term). Shorter periods produce faster signals but also more false ones.\nMACD (Moving Average Convergence Divergence) MACD Line = 12-day EMA - 26-day EMA Signal Line = 9-day EMA of MACD Line Histogram = MACD Line - Signal Line Developed by Gerald Appel. Uses Exponential Moving Averages (EMA) to capture both trend direction and strength. EMA gives more weight to recent data than SMA.\nWhen the MACD Line crosses above the Signal Line, it is a buy signal. Crosses below, a sell signal. The histogram visualizes the gap between the two lines. A transition from positive to negative (or vice versa) also signals a trend change.\nSimilar principle to SMA Cross, but responds more quickly to price changes because of EMA.\nBollinger Bands Middle Band = 20-day SMA Upper Band = Middle Band + (2 × Standard Deviation) Lower Band = Middle Band - (2 × Standard Deviation) Developed by John Bollinger. Expresses price volatility as bands around a moving average. Since standard deviation is used, bands widen when volatility is high and narrow when low.\nPrice breaks below the lower band — oversold. Buy signal. Breaks above the upper band — overbought. Sell signal. About 95% of price action stays within the bands statistically.\nA narrowing band (squeeze) indicates low volatility. A significant price move may follow. Direction is unknown, so enter after confirming the breakout direction.\nComposite Signals Four indicators covered. They share a common limitation. Trading on any single indicator is vulnerable to false signals. RSI can drop below 30 and keep falling. A golden cross can quickly reverse into a death cross. Combining multiple indicators reduces these false signals.\nWeighted Combination Assign a weight to each indicator. When the sum of triggered weights meets a threshold relative to total weights, fire a composite signal.\nSignal condition: sum of triggered weights / total weights ≥ threshold (e.g., 0.5) Example: combining RSI (weight 0.3), SMA Cross (weight 0.3), and MACD (weight 0.4). If RSI and MACD both trigger buy signals, (0.3 + 0.4) / 1.0 = 0.7. Above the 0.5 threshold, so a composite buy signal fires.\nLower threshold means higher sensitivity. Higher threshold means more conservative. This tension is tuned through backtesting.\nCooldown Prevents consecutive signals on the same ticker. Once a signal fires, signals for that ticker are ignored for a set period (e.g., 300 seconds).\nWithout cooldown, high-volatility periods generate repeated buy-sell signals. Unnecessary transaction costs accumulate.\nThree Strategy Types The technical indicators covered here belong to one of several strategy types in quantitative investing. The factor indicators from previous posts belong to another. Each type operates on a different time horizon.\nStrategy Type Decision Basis Time Horizon Stock Selection Signal (Technical) RSI, MACD, and other technical indicators Seconds/minutes (real-time) Fixed (user-specified) Factor (Screening) PER, ROE, and other factor scores Monthly/quarterly (rebalancing) Dynamic (rotated each rebalancing) Asset Allocation Drift from target weights When weight band is breached Fixed (asset class level) Signal strategies receive real-time price data and determine buy/sell decisions using technical indicators. RSI, SMA Cross, MACD, and Bollinger Bands covered in this post belong here. Operates on the shortest time horizon.\nFactor strategies score stocks using factors from previous posts (PER, ROE, momentum, etc.). Select the top N stocks at each rebalancing. Monthly or quarterly cycles.\nAsset Allocation strategies set target weights per holding. Rebalance when current weights drift outside a band. Holdings remain fixed unless the target weights change.\nThe three are not mutually exclusive. Select stocks via Factor, manage asset class weights via Asset Allocation, time entries and exits via Signal.\nTechnical indicators are tools for reading buy/sell signals from price movements. But knowing the tools and designing a strategy are different things. Choosing which indicators to combine and which strategy type to use — that is the design work.\nThis series started from stock data basics, progressed through company evaluation indicators, market trend indicators, backtesting, portfolio construction, and price-based technical indicators. The next step is applying these to real data — building and validating strategies firsthand.\nReferences Investopedia — RSI Investopedia — Moving Average (SMA) Investopedia — MACD Investopedia — Bollinger Bands J. Welles Wilder Jr., New Concepts in Technical Trading Systems (1978) Gerald Appel, Technical Analysis: Power Tools for Active Investors (2005) John Bollinger, Bollinger on Bollinger Bands (2001) ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/signal-technical-indicators/","summary":"Covers technical indicators (RSI, SMA Cross, MACD, Bollinger Bands) for reading price movements, composite signal design, and a comparison of three strategy types across different time horizons.","title":"Technical Indicators and Trading Signals"},{"content":"Previous posts covered individual indicators: PER, ROE, momentum, dividend yield. The question now is \u0026ldquo;how to combine these indicators into a single score, and how to construct and manage a portfolio from that score.\u0026rdquo;\nFactor Investing Factor investing systematically selects stocks based on specific characteristics (factors). \u0026ldquo;Buy low-PER stocks\u0026rdquo; is a form of factor investing — selecting stocks based on the PER factor.\nSingle Factor vs Multi-Factor Single-factor strategies select stocks using one indicator. Low-PER strategies, high-ROE strategies, and high-momentum strategies fall into this category. They are intuitive but limited. As covered earlier, low-PER stocks can be value traps, and high-momentum stocks may already be overvalued.\nMulti-factor strategies combine multiple indicators into a composite score. By reflecting multiple perspectives — value, quality, momentum, dividend — simultaneously, they compensate for each individual factor\u0026rsquo;s weaknesses. Both academic research and practice consistently show multi-factor approaches producing more stable results than single-factor ones.\nWhen constructing a multi-factor strategy, assign weights to each factor. Equal weighting (same weight for all factors) is the simplest approach. To emphasize a particular factor, increase its weight. There is no single correct weighting — validation through backtesting is required.\nFactor Scoring Combining multiple factors into one score requires converting each factor\u0026rsquo;s values into comparable units. PER operates in the 10s, ROE in percentages like 15%, and momentum in decimals like 0.3. Adding values with different scales directly causes certain factors to dominate the score.\nZ-Score Z = (Value - Mean) / Standard Deviation Z-Score indicates how many standard deviations each value is from the mean. It transforms all factors to the same scale with mean 0 and standard deviation 1.\nZ-Score\u0026rsquo;s advantage is preserving distribution information. It reflects the difference between exceptionally good stocks and average ones. The disadvantage is sensitivity to outliers. A single stock with PER of 1,000 compresses all other stocks\u0026rsquo; Z-Scores near zero.\nRank Rank-based scoring uses rankings instead of raw values. Among 100 stocks, the one with the lowest PER ranks 1st, the highest ranks 100th. Rankings are normalized to a 0-1 range for scoring.\nRank\u0026rsquo;s advantage is robustness to outliers. A stock with PER of 1,000 simply receives the last rank without affecting other stocks\u0026rsquo; scores. In practice, Rank is used more frequently than Z-Score.\nlower_is_better Handling PER, PBR, and debt ratio are better when lower. Momentum, ROE, and dividend yield are better when higher. When combining factors with different directions into a single score, \u0026ldquo;lower is better\u0026rdquo; factors must have their signs inverted or rankings reversed. Missing this step causes scores to work opposite to intent.\nRebalancing Rebalancing restores a portfolio to its target allocation after it drifts. Over time, differing returns across holdings cause the actual allocation to diverge from the initial target. Periodically reverting to the original allocation is rebalancing.\nRebalancing Frequency Frequency Characteristics Daily Most precise but highest transaction costs Weekly Balance between cost and precision Monthly Most common choice Quarterly Lowest cost but larger allocation drift Rebalancing too frequently accumulates transaction costs. Rebalancing too rarely allows significant drift from intended allocations. Monthly rebalancing is the most widely adopted compromise.\nRebalancing Bands Trading at every rebalancing date regardless of drift is inefficient. Rebalancing bands set a rule: \u0026ldquo;trade only when allocation deviates from target by more than a threshold.\u0026rdquo; Small deviations are left alone; trades occur only when the threshold is crossed. This reduces unnecessary transactions and saves costs.\nAsset Allocation Asset allocation distributes capital across multiple asset types. Beyond stocks, it includes bonds, gold, commodities, and other assets.\nThe Principle of Diversification \u0026ldquo;Don\u0026rsquo;t put all your eggs in one basket\u0026rdquo; captures the essence of asset allocation. When stocks decline, bonds may rise. During inflation, gold may appreciate. The lower the correlation between assets, the greater the diversification benefit.\nThe goal of asset allocation is not maximizing returns. It is maintaining adequate returns while reducing risk. A 60% stocks + 40% bonds portfolio yields less than 100% stocks, but MDD decreases significantly.\nTraditional Allocations Allocation Characteristics 60% Stocks + 40% Bonds The most traditional balanced portfolio 80% Stocks + 20% Bonds Growth-oriented. Sometimes recommended for younger investors 40% Stocks + 40% Bonds + 20% Gold/Commodities All-weather style multi-asset allocation No single allocation ratio is universally correct. It depends on the investor\u0026rsquo;s goals, time horizon, and risk tolerance.\nSo far, this series covered indicators for evaluating companies (PER, PBR, ROE, debt ratio), reading market trends (momentum, dividends), validating strategies (backtesting), and constructing portfolios (factor scoring, rebalancing, asset allocation).\nThe next post shifts perspective to technical indicators — reading buy/sell timing from price movements themselves.\nReferences Investopedia — Factor Investing Investopedia — Rebalancing Investopedia — Asset Allocation Investopedia — Z-Score Andrew Ang, Asset Management: A Systematic Approach to Factor Investing (2014) ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/portfolio-factor-scoring/","summary":"Covers factor scoring methods (Z-Score, Rank) for combining indicators into a single score, plus the basics of rebalancing and asset allocation.","title":"Portfolio Management and Factor Scoring"},{"content":"Once a strategy is built, it must be validated against historical data. This is backtesting. But does a high CAGR make a good strategy? What about CAGR of 30% with MDD of -50%? A single metric can mislead.\nReturn Metrics CAGR CAGR (Compound Annual Growth Rate) annualizes the total return over the investment period.\nCAGR = (Final Value / Initial Value)^(1/Years) - 1 A total return of 33% over 3 years translates to a CAGR of about 10%. It enables comparison across strategies with different time horizons.\nCAGR only tells you the magnitude of returns, not how volatile the path was. Two strategies with the same CAGR may have followed vastly different trajectories — one steady, the other crashing -40% before recovering.\nAlpha Alpha = Strategy Return - Benchmark Return Alpha measures excess return over the market (benchmark). Common benchmarks include the KOSPI index or S\u0026amp;P 500. Positive Alpha means the strategy outperformed the market.\nThe goal of a quant strategy is to generate positive Alpha. Market returns can be captured by simply buying an index fund. A strategy\u0026rsquo;s value lies in the Alpha it adds on top.\nRisk Metrics MDD MDD (Maximum Drawdown) is the largest peak-to-trough decline during the strategy\u0026rsquo;s operation.\nMDD = (Peak - Trough) / Peak × 100% It answers \u0026ldquo;how much could you lose in the worst case?\u0026rdquo; An MDD of -30% means assets fell from $1 million to $700,000 at some point.\nMDD matters because of psychological limits. Even with a high CAGR, a strategy with -50% MDD is difficult to maintain in practice. Few investors can tolerate watching their assets halve.\nSharpe Ratio Sharpe Ratio = (Strategy Return - Risk-Free Rate) / Standard Deviation of Returns Sharpe Ratio measures return efficiency relative to risk. For the same return, lower volatility yields a higher Sharpe.\nSharpe Ratio Interpretation \u0026lt; 0 Worse than risk-free 0 – 1.0 Average 1.0 – 2.0 Good \u0026gt; 2.0 Excellent Consider a strategy with CAGR 30% and MDD -50% versus one with CAGR 15% and MDD -15%. The latter likely has a higher Sharpe Ratio. Returns are half, but risk is far lower. Return magnitude and return efficiency are different concepts.\nOperational Metrics Win Rate Win rate is the proportion of profitable trades. A 60% win rate means 6 profitable trades out of 10.\nA high win rate does not guarantee good performance if the losses are large. Conversely, a 30% win rate can still produce strong results if winning trades are large enough. The combination of win rate × average gain/loss ratio matters more than win rate alone.\nTurnover Turnover measures how frequently the portfolio is replaced.\nHigh turnover means high transaction costs. Each buy/sell incurs commissions, and slippage (the gap between expected and actual execution price) accumulates. Backtests that ignore transaction costs overstate the performance of high-turnover strategies.\nBacktesting Pitfalls Strong backtest results do not guarantee real-world success.\nLook-ahead Bias Using future data at the current point in time. For example, quarterly earnings for March 31 are reported weeks later. Using March 31 data on March 31 means using information that was not yet available. Momentum scores must use only data available as of the rebalancing date.\nSurvivorship Bias Distortion from excluding delisted stocks from the dataset. Backtesting only on surviving stocks inflates performance. In reality, you might have invested in a stock that was later delisted, resulting in losses. This bias is directly tied to data source limitations.\nOverfitting Building a strategy that fits historical data perfectly. Excessive parameter tuning produces flawless past performance but poor future results. Minimizing parameter count and validating on out-of-sample data are the standard countermeasures.\nIgnoring Transaction Costs Backtests without commissions and slippage show better results than reality. The gap is especially large for high-turnover strategies. Setting a realistic fee rate during backtesting is essential.\nIgnoring Liquidity Small-cap stocks with low trading volume appear tradeable in backtests but may not execute at desired prices in practice. Filtering out low-liquidity stocks using a minimum market cap threshold is standard practice.\nBacktest performance metrics span three axes: returns (CAGR, Alpha), risk (MDD, Sharpe Ratio), and operations (Win Rate, Turnover). A single metric distorts judgment. All three axes must be examined together to assess a strategy\u0026rsquo;s true value.\nEven strong backtest results may be contaminated by five pitfalls: Look-ahead Bias, Survivorship Bias, Overfitting, ignoring transaction costs, and ignoring liquidity. Reading results and questioning results must go hand in hand.\nThe next post will cover how to combine individual indicators into a single score and construct portfolios — factor scoring and rebalancing.\nReferences Investopedia — Compound Annual Growth Rate (CAGR) Investopedia — Maximum Drawdown (MDD) Investopedia — Sharpe Ratio Investopedia — Overfitting Marcos López de Prado, Advances in Financial Machine Learning (2018) ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/backtest-metrics/","summary":"Covers the formulas and interpretation of CAGR, MDD, Sharpe Ratio, and other backtest metrics, plus five common backtesting pitfalls.","title":"Backtest Performance Metrics"},{"content":"If valuation and quality evaluate \u0026ldquo;the company itself,\u0026rdquo; momentum looks at \u0026ldquo;market trends\u0026rdquo; and dividends look at \u0026ldquo;cash flow.\u0026rdquo; Understanding these two factors broadens the lens for stock selection.\nMomentum Definition Momentum is the return over a past period.\nN-month Momentum = (Current Price - Price N months ago) / Price N months ago The core principle is simple. Stocks that rise tend to keep rising, and stocks that fall tend to keep falling. This is called trend following. It is an academically validated phenomenon. Jegadeesh and Titman showed in their 1993 paper that buying past 3-12 month winners and selling losers generated significant excess returns.\nPeriod Characteristics Momentum behaves differently depending on the measurement period.\n1-month momentum. In the short term, a reversal effect can appear instead of trend following. Stocks that surged may pull back from overbought levels, or crashed stocks may bounce from oversold levels. This is why 12-month momentum strategies commonly exclude the most recent month.\n3-6 month momentum. This is where the momentum effect appears most strongly. Trends in this range relate to the speed at which fundamental information — earnings announcements, analyst reports — gets priced in.\n12-month momentum. Captures long-term trends. Typically calculated excluding the most recent month to avoid the short-term reversal effect.\nWhy Momentum Works Behavioral economics explains why momentum exists.\nUnderreaction. Investors do not react immediately to new information. Even after strong earnings, the stock price adjusts gradually rather than all at once. This gradual adjustment creates trends.\nHerding. As prices rise, more investors join the buying. This positive feedback loop reinforces the trend.\nConfirmation bias. Investors weigh information that supports their existing views more heavily. In uptrends, they react more to positive news; in downtrends, more to negative news.\nCaution The most critical consideration in momentum calculation is preventing Look-ahead Bias. Momentum scores must be calculated as of the rebalancing date. Using future data at the current point produces backtest results better than reality.\nDividend Dividend Yield Dividend Yield = Annual Dividend / Current Price × 100% Dividend yield shows how much dividend income you receive relative to the stock price. A 5% yield on a $100 stock means $5 in annual dividends.\nDividends provide cash flow independent of price appreciation. Even if the stock price does not rise, dividends alone generate returns.\nCharacteristics of High-Dividend Stocks High-dividend stocks are typically mature companies. Their businesses have stabilized, and they return a significant portion of profits to shareholders. They are common in utilities, telecommunications, and financial sectors.\nThe advantage is stable cash flow. Even in market downturns, dividend income partially offsets losses. The disadvantage is potentially lower growth. Money paid as dividends is not reinvested in the business.\nA high dividend yield is not always positive. If the stock price drops sharply, the dividend yield rises mechanically. In such cases, a high yield may reflect a distress signal rather than generosity. Dividend sustainability should always be verified.\nRelationships Between Factors Momentum and dividends, together with valuation and quality, form the building blocks of multi-factor strategies. Interesting relationships exist among them.\nMomentum and value often point in opposite directions. Value stocks (low PER/PBR) may be stocks whose prices have fallen. Fallen stocks have low momentum. Conversely, high-momentum stocks may have risen to the point where PER/PBR is elevated.\nDividends and quality correlate. Sustaining high dividend payments requires consistent earnings. Companies with high ROE and low debt ratios are more likely to maintain stable dividends.\nThese inter-factor correlations are why multi-factor strategies — combining several factors — tend to produce more stable results than single-factor approaches.\nMomentum follows market trends. Dividends verify cash flow. Because they offer different perspectives from valuation and quality, combining them diversifies stock selection.\nThe next post will cover how to measure strategy performance — the key metrics of backtesting.\nReferences Jegadeesh \u0026amp; Titman (1993), \u0026ldquo;Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency\u0026rdquo; Investopedia — Momentum Investing Investopedia — Dividend Yield AQR — Fact, Fiction and Momentum Investing ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/momentum-dividend/","summary":"Covers the role of price momentum and dividend yield as factors in quant investing.","title":"Momentum and Dividend"},{"content":"A stock being \u0026ldquo;cheap\u0026rdquo; and being \u0026ldquo;good\u0026rdquo; are different things. A stock with a low PER may look cheap, but the company\u0026rsquo;s earnings could be declining. A stock with a high ROE may look good, but it could already be overvalued. Evaluating a company properly requires looking at both valuation (cheap or expensive?) and quality (earns well or not?) together.\nValuation Indicators Valuation indicators answer \u0026ldquo;is this stock cheap or expensive at its current price?\u0026rdquo; In general, lower means undervalued, higher means overvalued.\nPER PER (Price-to-Earnings Ratio) divides the stock price by earnings per share (EPS).\nPER = Price / Earnings Per Share (EPS) A PER of 10 means \u0026ldquo;assuming current earnings continue, it takes 10 years to recoup the investment.\u0026rdquo; It is intuitive and the most widely used valuation indicator.\nA low PER suggests the stock is undervalued relative to its earnings. However, there are caveats.\nNegative PER for loss-making companies. If earnings are negative, PER is also negative. A negative PER is meaningless as a comparison metric. PSR is typically used instead for loss-making companies.\nIndustry differences. IT companies often have average PERs of 20-30x, while banks and utilities typically range from 5-10x. Industries with high growth expectations reflect future earnings, resulting in higher PERs. The same PER carries different implications depending on the industry.\nPBR PBR (Price-to-Book Ratio) divides the stock price by book value per share (BPS).\nPBR = Price / Book Value Per Share (BPS) PBR evaluates the stock price relative to the company\u0026rsquo;s asset value. A PBR of 1 means the stock price equals net asset value. A PBR below 1 implies that even liquidating the company would yield more than the current stock price.\nBenjamin Graham viewed low-PBR stocks as investments with a \u0026ldquo;margin of safety.\u0026rdquo; It remains a core value investing indicator. However, a low PBR is not always positive. When asset values are overstated (distressed assets, inadequate depreciation), a low PBR may not reflect genuine safety margin.\nPSR PSR (Price-to-Sales Ratio) divides the stock price by sales per share.\nPSR = Price / Sales Per Share PSR\u0026rsquo;s advantage is that it applies to loss-making companies. Revenue is almost always positive. It is useful for valuing early-stage growth companies that have not yet turned a profit.\nA low PSR suggests the stock is undervalued relative to revenue. However, PSR does not reflect profit margins. If a company generates high revenue but no profit, PSR alone is insufficient for evaluation.\nQuality Indicators Quality indicators answer \u0026ldquo;does this company earn well, and is it financially safe?\u0026rdquo; Unlike valuation indicators, higher is generally better (except for debt ratio).\nROE ROE (Return on Equity) divides net income by shareholders\u0026rsquo; equity.\nROE = Net Income / Shareholders\u0026#39; Equity × 100% It shows how much the company earned with the money shareholders invested. An ROE of 15% means the company generated 15 in net income for every 100 in equity.\nWarren Buffett reportedly considers companies that consistently maintain ROE above 15% as high-quality businesses. Higher ROE indicates better profitability relative to shareholder capital.\nHowever, a high ROE may result from debt leverage. If equity is low and debt is high, the denominator shrinks, inflating ROE. To distinguish this, examine ROA alongside ROE.\nROA ROA (Return on Assets) divides net income by total assets.\nROA = Net Income / Total Assets × 100% It shows how efficiently the company utilizes all assets (equity + debt).\nComparing ROE and ROA reveals the debt leverage effect. High ROE with low ROA indicates returns driven by borrowed money. High ROE with high ROA indicates genuine asset efficiency.\nHigh ROE + High ROA → Efficient, high-quality company High ROE + Low ROA → Reliant on debt leverage Low ROE → Low profitability overall Debt Ratio The debt ratio divides total liabilities by shareholders\u0026rsquo; equity.\nDebt Ratio = Total Liabilities / Shareholders\u0026#39; Equity × 100% Lower is safer. A debt ratio of 100% means liabilities equal equity. Above 200% is generally considered a warning sign.\nDebt itself is not inherently bad. Appropriate leverage supports business expansion. The problem arises when debt becomes too large to service, or when economic downturns amplify repayment pressure. Quant strategies use debt ratio as a safety factor to filter out excessively leveraged stocks.\nCombining the Two Axes Viewing valuation and quality indicators together matters more than looking at each in isolation.\nLow PER + High ROE. Cheap relative to earnings and highly profitable. This is the ideal combination for an undervalued quality stock. In practice, such opportunities rarely persist because the market reprices quickly.\nLow PER + Low ROE. Cheap but poor earnings. \u0026ldquo;Cheap for a reason.\u0026rdquo; This is called a value trap. Relying on PER alone can lead to this pitfall.\nHigh PER + High ROE. Expensive but highly profitable. This combination frequently appears in growth stocks. The question becomes whether the premium reflects genuine future growth or overvaluation.\nA single indicator can mislead. Combining valuation and quality is the starting point of a multi-factor strategy.\nValuation indicators (PER, PBR, PSR) judge \u0026ldquo;is the stock cheap?\u0026rdquo; Quality indicators (ROE, ROA, debt ratio) judge \u0026ldquo;does the company earn well and is it safe?\u0026rdquo; Evaluating both axes together is essential for proper stock assessment.\nThe next post will cover momentum and dividend indicators — factors that look at market trends and cash flow rather than the company itself.\nReferences Investopedia — Price-to-Earnings Ratio (P/E Ratio) Investopedia — Price-to-Book Ratio (P/B Ratio) Investopedia — Price-to-Sales Ratio (P/S Ratio) Investopedia — Return on Equity (ROE) Investopedia — Return on Assets (ROA) Investopedia — Debt-to-Equity Ratio Benjamin Graham, The Intelligent Investor (1949) ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/valuation-quality-indicators/","summary":"Covers how to judge whether a company is cheap (valuation) and whether it earns well (quality) — formulas and interpretation of PER, PBR, PSR, ROE, ROA, and debt ratio.","title":"Valuation and Quality Indicators"},{"content":"Quant investing starts with data. Whether analyzing stocks or backtesting strategies, the first thing you encounter is price data. The foundational terms that underpin all subsequent indicators and strategies are OHLCV, returns, and market capitalization.\nOHLCV OHLCV summarizes a day\u0026rsquo;s price movement in five numbers.\nAbbreviation Meaning Description O (Open) Opening price Price at market open H (High) High price Highest price during the day L (Low) Low price Lowest price during the day C (Close) Closing price Price at market close V (Volume) Trading volume Number of shares traded during the day Most quant analysis uses the closing price as its basis. The close represents the final consensus price of the day and serves as the default input for return calculations, moving averages, and technical indicators. The opening price may differ from the previous close — this difference is called a gap, caused by overnight news or foreign market movements.\nVolume indicates the reliability of a price movement. A price increase on low volume may be a temporary fluctuation driven by few trades. A price increase accompanied by high volume reflects broader market consensus.\nOHLCV data can be collected from sources like Yahoo Finance, brokerage APIs, and exchange data systems.\nReturns Returns are the most fundamental measure of investment performance. The same return can be calculated in two ways.\nSimple Return Simple Return = (Today\u0026#39;s Close - Yesterday\u0026#39;s Close) / Yesterday\u0026#39;s Close This is intuitive. If a stock was $100 yesterday and $105 today, the return is 5%. For a single day, this is accurate.\nThe problem arises when aggregating returns over multiple periods. Suppose day 1 returns +10% and day 2 returns -10%. Adding them gives 0%, but the actual result differs.\n$10,000 × 1.10 = $11,000 (Day 1) $11,000 × 0.90 = $9,900 (Day 2) The actual result is -1%. The arithmetic sum of simple returns (+10% + (-10%) = 0%) does not match reality. Log returns solve this problem.\nLog Return Log Return = ln(Today\u0026#39;s Close / Yesterday\u0026#39;s Close) This uses the natural logarithm. The key advantage of log returns is that they are additive over time.\nDay 1 log return: ln(11,000 / 10,000) = 0.0953 Day 2 log return: ln(9,900 / 11,000) = -0.1054 Sum: 0.0953 + (-0.1054) = -0.0101 Converting this sum back to an actual return: e^(-0.0101) - 1 ≈ -1.00%, which matches the real result.\nQuant analysis favors log returns. They simplify multi-period aggregation and statistical analysis (normal distribution assumptions). However, when daily returns are small (within ±5%), the difference between the two methods is negligible. The difference becomes meaningful during high-volatility periods or when working with long-term cumulative returns.\nMarket Capitalization Market Cap = Current Price × Shares Outstanding Market capitalization represents the total value the market assigns to a company. Stock price alone cannot compare company sizes. A $50 stock can have a larger market cap than a $5,000 stock if it has far more shares outstanding.\nSize Classification Market cap classifies stocks by size. In the Korean market, the Korea Exchange (KRX) categorizes stocks into KOSPI 200, mid-cap, and small-cap groups. Common classification criteria:\nCategory Approximate Criteria Large-cap Top ~100 by market cap (KOSPI 200 constituents) Mid-cap Below large-cap, top ~300 Small-cap Below mid-cap In the US market, large-cap generally means market cap above $10 billion, and small-cap below $2 billion.\nRole in Screening In quant strategies, market capitalization serves as the first filter for stock selection. Small-cap stocks often have low trading volume, making actual buying and selling difficult. Their price volatility is high. They may show strong backtest results but prove difficult to replicate in practice. Setting a minimum market cap threshold to exclude small-caps is standard practice.\nOHLCV is the basic unit of price data. Returns are the language of performance measurement. Market capitalization is the standard for judging company size. These three concepts form the foundation for understanding valuation, quality, and momentum indicators in subsequent posts.\nThe next post will cover valuation and quality indicators — the numbers used to judge whether a company is cheap and whether it earns well.\nReferences Investopedia — Open-High-Low-Close (OHLC) Investopedia — Rate of Return Investopedia — Market Capitalization Investopedia — Log-Normal Distribution and Logarithmic Returns KRX Information Data System ","permalink":"https://wid-blog.github.io/en/posts/daily/investment/stock-data-basics/","summary":"Covers OHLCV data as the starting point of quant investing, the difference between simple and log returns, and the meaning of market capitalization.","title":"Stock Data Basics"},{"content":"Changing data formats in a running service happens regularly. Column encryption, type changes, JSON schema updates, normalization or denormalization. A big-bang approach that stops the service for a single cutover is risky. If something goes wrong during the transition, the entire service goes down.\nCombining dual write with fallback read enables format transitions without service interruption. The key is maintaining a rollback-safe state at each step.\nDual Write + Fallback Read Dual write stores data in both the old format and the new format. During the transition period, the same data coexists in both places.\nFallback read uses the new format if a value exists; otherwise it falls back to the old format. Data not yet converted to the new format is still readable.\nflowchart TD W[\"Write\"] --\u003e W1[\"Store in old format\"] W --\u003e W2[\"Store in new format\"] R[\"Read\"] --\u003e C{\"New formathas value?\"} C --\u003e|\"Yes\"| D[\"Use new format\"] C --\u003e|\"No\"| E[\"Use old format\"] Combining these two creates a transition period where old and new data coexist.\nThree-Step Process The transition splits into three steps. Each step proceeds only after the previous one has been deployed.\nflowchart LR S1[\"Step 1: PreparationAdd new formatStart dual writeApply fallback read\"] S2[\"Step 2: MigrationConvert existing datato new formatdry-run → execute\"] S3[\"Step 3: CleanupRemove fallbackDrop old format\"] S1 -- \"Deployed\" --\u003e S2 -- \"Verified\" --\u003e S3 Step 1: Preparation Add the new format and modify the code.\nSchema change: Add new columns or fields. Start as nullable since existing data does not have the new format yet. Dual write: On INSERT and UPDATE, write values to both old and new formats. Fallback read: On SELECT, use the new format if a value exists; otherwise return the old format\u0026rsquo;s value. Once deployed, new data is stored in both places. Existing data remains only in the old format, handled by fallback read.\nRollback: Ignore the new format and everything works as before. Just revert the code change.\nStep 2: Migration Batch-convert existing data to the new format.\nWrite and run a batch script. Find rows where the new format is empty, convert the old format value, and store it in the new format.\nRun dry-run first. Check target row count and estimated duration. For large datasets, adjust batch size to manage DB load.\nAfter execution, verify. Confirm that new format values match old format values. Cross-check total row counts. Verification often takes more time than the migration itself.\nRollback: The fallback read from Step 1 is still active. Any issues with the new format automatically fall back to the old format.\nStep 3: Cleanup After migration is complete and verified, remove the old format and fallback logic.\nData verification: Confirm no nulls or empty values in the new format once more. Check that data inserted after Step 2 is also in the new format. Code cleanup: Remove dual writes and fallback branches. Consolidate to use only the new format. Schema cleanup: Drop old format columns or fields. No rollback: Dropping the old format deletes the original data. This is why thorough verification is essential.\nUse Cases This pattern is not limited to DB column encryption. Any situation where data format changes and the service cannot stop follows the same structure.\nColumn encryption. Add an encrypted column next to the plaintext column, dual-write to both, batch-encrypt existing plaintext, then drop the plaintext column.\nColumn type change. varchar(100) → text, int → bigint. Add the new-type column, transition via dual write + fallback, then drop the old column.\nJSON schema change. When renaming keys or restructuring a JSON column, create a transition period that supports both old and new structures simultaneously.\nNormalization / denormalization. When adding denormalized columns to reduce joins, or splitting data into separate tables for normalization, the dual write + fallback structure applies.\nCost of the Pattern This pattern provides safety, but a cost comes with it.\nCode complexity increases during the transition period. Dual write and fallback branches are added throughout the service code. This code is removed in Step 3, but during the transition it increases review and maintenance burden.\nIf multiple targets need transition, this cost repeats. Ten tables means applying the same pattern ten times. Since the structure is identical across repetitions, automation becomes a consideration.\nThe conditions for this pattern are clear: multiple services reference the same data, large data volumes exist, and service downtime is not acceptable. Outside these conditions, scheduling a maintenance window for a single cutover may be simpler.\nReferences Envelope Encryption — Covers the key management structure that pairs with column encryption transitions. ","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/zero-downtime-data-transition/","summary":"A three-step pattern combining dual write and fallback read to transition data formats in live services without downtime.","title":"Zero-Downtime Data Transition Pattern"},{"content":"The simplest way to encrypt sensitive data in a database is to use a single key for everything. But if that key leaks, all data is exposed. Replacing the key means re-encrypting all data.\nEnvelope encryption solves this by separating the \u0026ldquo;key that encrypts keys\u0026rdquo; from the \u0026ldquo;key that encrypts data.\u0026rdquo;\nSymmetric Encryption For use cases like DB column encryption, where every write requires encryption and every read requires decryption, symmetric encryption is the right fit. A single key handles both operations, keeping computational cost low.\nAsymmetric encryption uses a public/private key pair. It serves key exchange and digital signatures well, but carries higher computational cost and more complex key management than symmetric encryption. For frequent encrypt/decrypt cycles in DB columns, it adds unnecessary overhead.\nAES-256-GCM Among symmetric algorithms, AES-256-GCM is a common choice because it satisfies security, integrity, and performance together.\nOn key length, AES-256 uses a 256-bit key that provides strong resistance against brute-force attacks. It is also the most widely vetted symmetric algorithm available.\nThe mode of operation, GCM (Galois/Counter Mode), has two advantages compared to other modes like CBC.\nFirst, integrity. GCM generates an authentication tag alongside the ciphertext. Tampering is detected at decryption time. CBC requires a separate HMAC step for this; GCM handles it in a single operation.\nSecond, it uses an IV (initialization vector) to produce different ciphertext even for identical plaintext. The IV is stored alongside the ciphertext and must be used during decryption to recover the original.\nGCM has one important pitfall, though. Encrypting twice with the same key and same IV completely breaks the security. IVs must be unique for every encryption, typically generated by a secure random source or managed via a counter-based scheme.\nCMK/DEK Two-Tier Structure The core of envelope encryption is splitting keys into two layers.\nflowchart LR CMK[\"CMK(Master Key)\"] --\u003e|\"Encrypts/Decrypts DEK\"| DEK[\"DEK(Data Encryption Key)\"] DEK --\u003e|\"Encrypts/Decrypts data\"| DATA[\"Plaintext Data\"] At the top, the CMK (Customer Master Key) is used solely to encrypt DEKs and never touches data directly. It typically resides inside an HSM (hardware security module) and cannot be extracted.\nThe actual data is handled by the DEK (Data Encryption Key), which sits one layer below. Data is encrypted with the plaintext DEK, and then the DEK itself is re-encrypted by the CMK and stored.\nEncryption Flow A DEK is issued once from the key management service. At issuance, both the plaintext DEK and the encrypted DEK are returned together. The plaintext DEK encrypts the data immediately, and the encrypted DEK is stored alongside the data in the DB.\nsequenceDiagram participant App as Service participant KMS as Key Management Service participant DB as DB App-\u003e\u003eKMS: Request new DEK (specify CMK) KMS--\u003e\u003eApp: Return plaintext DEK + encrypted DEK App-\u003e\u003eApp: Encrypt data with plaintext DEK + IV App-\u003e\u003eDB: Store ciphertext + IV + encrypted DEK Note over App: Discard plaintext DEK immediately When reading, fetch the ciphertext, IV, and encrypted DEK from the DB, request decryption of the encrypted DEK from the key management service, and decrypt the ciphertext with the returned plaintext DEK and IV. The plaintext DEK is held briefly in memory and discarded immediately.\nComparison with Single-Key Approach With a single key, a leak exposes all data. With envelope encryption, a leaked DEK only affects the data encrypted by that specific DEK. The CMK resides in an HSM, making its leak far less likely.\nKey Rotation The practical advantage of envelope encryption becomes clear during key rotation.\nWith a single key, replacing it means re-encrypting all data. For large datasets, this is expensive in both time and cost.\nWith envelope encryption, rotating the CMK only requires re-encrypting the DEKs. The data remains encrypted with the same DEKs and does not need re-encryption. Since DEKs are tiny compared to the data, rotation cost is minimal.\nflowchart LR subgraph Before [\"Before Rotation\"] CMK1[\"CMK v1\"] --\u003e DEK_ENC1[\"Encrypted DEK\"] end subgraph After [\"After Rotation\"] CMK2[\"CMK v2\"] --\u003e DEK_ENC2[\"Encrypted DEK(re-encrypted)\"] end DEK_ENC1 -.-\u003e|\"Only DEK re-encryptedData unchanged\"| DEK_ENC2 Regular key rotation is essential for defending against insider threats. Envelope encryption minimizes the cost of this practice.\nUsing Cloud Key Management Services While envelope encryption can be implemented from scratch, leveraging cloud key management services is the common approach. Selection criteria include:\nHSM-based key storage: Whether CMKs exist only inside hardware, not software. HSM-based storage makes key extraction impossible. Automatic key rotation: Whether the service supports automatic CMK rotation. Manual rotation adds operational burden. Container environment compatibility: Whether keys can be injected into K8S environments without API calls from service code. This affects service code complexity. Separating the service that issues keys from the service that stores encrypted keys is another common configuration. It limits how a single service compromise can propagate to the entire key set, and aligns with the principle of least privilege.\nWhen Envelope Encryption Fits Envelope encryption is effective under these conditions:\nLarge volumes of encrypted data with regular key rotation requirements Multiple services sharing the same encryption keys Compliance requirements for key management audit trails For small-scale, one-time encryption, a single-key approach may suffice. Envelope encryption adds key management complexity, so it should be chosen when the scale and requirements justify that complexity.\nReferences Zero-Downtime Data Transition Pattern — Covers how to transition data formats like column encryption without downtime. ","permalink":"https://wid-blog.github.io/en/posts/tech/security/envelope-encryption/","summary":"How the CMK/DEK two-tier key structure in envelope encryption limits key leak impact and simplifies key rotation.","title":"Envelope Encryption"},{"content":"Late last year, a conversation started about improving the ad Fallback\u0026rsquo;s performance by introducing a CTR prediction model.\nFallback kicks in when the primary ad system decides there\u0026rsquo;s no ad to serve. Its purpose is to raise fill rate — the ratio of impressions to ad slots.\nI was a backend engineer. I had no background in AI.\nThe expectation was that wiring the surrounding systems would take more work than the model itself, so the project landed on my plate.\nModel Choice: Logistic Regression The model was Logistic Regression.\nSince the goal was improving ad CTR, we just needed to learn whether a given impression would be clicked — a binary classification problem.\nLR and LightGBM are commonly used in ad platforms. But this was an initial version, and I didn\u0026rsquo;t want to take on complex tuning and operational burden from day one.\nSo I picked the simpler option: LR.\nLanguage and Framework Choice I went with Python and sklearn. For both the training batch and the inference server.\nI initially considered ONNX + Go. A new project felt like a place where I could start with Go. For inference, pushing the model through ONNX would give me framework independence and better performance.\nBut the internal ML operating environment was Python-centric. The reference examples, the shareable code, the deployment patterns — all in Python. When you need advice and reviews, the same language felt like the right call. I set aside the performance angle and chose continuity of operations.\nThe framework choice followed similar logic. I knew ONNX has better inference performance than sklearn, but for a lightweight model like LR, that gain wouldn\u0026rsquo;t move the needle. sklearn felt enough for training and saving, and forcing a heavy pipeline onto a light model seemed like overengineering to me.\nML Lifecycle Architecture I divided the ML Lifecycle into three components.\nTraining batch: Periodically trains the LR model and pushes the trained model to the model store. Model store: Built on MLflow. Keeps versioned copies of models written by the training batch. Inference server: Loads the latest model from the store and serves real-time predictions. flowchart LR A[\"Training batch\"] --\u003e|\"① push model② move champion alias\"| B[\"Model store(MLflow)\"] A --\u003e|\"③ trigger deployment\"| C[\"Inference server\"] B -.-\u003e|\"④ load champion on pod startup\"| C The flow is simple: training batch → model store → inference server. The three components connect only through model files, and the training schedule runs independently from inference.\nInside the Training Batch: The Promotion Gate The training batch wasn\u0026rsquo;t just \u0026ldquo;train → save.\u0026rdquo; Once training finished, the model had to pass through a Promotion Gate — a quality check — before the champion alias would move.\nflowchart LR A[\"Data loading\"] --\u003e B[\"Preprocessing\"] --\u003e C[\"Training\"] --\u003e D[\"Evaluation\"] --\u003e E{\"Promotion Gate\"} E --\u003e|\"PASS\"| F[\"Update champion alias+ trigger rollout\"] E --\u003e|\"FAIL\"| G[\"Keep current champion\"] The criteria were simple. If the trained model\u0026rsquo;s evaluation metrics crossed the predefined thresholds, it passed; otherwise, it failed. On pass, the champion alias moved to the new version and a rollout was triggered. On failure, the new model was only logged to the registry while the current champion kept serving traffic.\nThis meant a degraded model couldn\u0026rsquo;t accidentally reach production — without any code changes.\nDeployment For getting new models into the inference server, I used k8s-based rolling deployment.\nMLflow\u0026rsquo;s alias feature lets you tag a model version with a name like \u0026ldquo;champion\u0026rdquo; to point at the current production model. When the training batch passes the Promotion Gate, it moves the champion alias to the new version and triggers a deployment. Inference server pods are replaced one at a time, and each new pod loads whichever model has the champion alias on startup before entering the service.\nLooking Back The LR + sklearn + MLflow combination was simple, and it ran light and fast.\nWhat I regret most was choosing Python + sklearn. As features grew, inference cost climbed and the resources required grew with it. If we had gone with ONNX + Go and used multiple cores inside a single process, the same load could probably have been handled with fewer resources. At the time, I judged that continuity of operations was the right call — but the cost of that decision showed up in the operational phase.\nStarting out, my biggest worry was \u0026ldquo;can I do this without an AI background?\u0026rdquo; By the end, I found that what I needed was a bit different. It wasn\u0026rsquo;t ML algorithms or infrastructure expertise — what mattered more was how precisely I understood the domain, and being able to judge which features to combine and how. Reading data and spotting patterns — that analysis skill — turned out to be just as important.\nReferences Revisiting Logistic Regression Choosing a Model Training Framework: sklearn vs ONNX MLflow: Filling the Gap in the ML Lifecycle ","permalink":"https://wid-blog.github.io/en/posts/career/dable/dsp-fallback-ctr-ml-lifecycle/","summary":"Building my first ML Lifecycle — a three-tier architecture for an ad Fallback CTR prediction — as a backend engineer without an AI background. The technical decisions I made, and what I learned through running it.","title":"LR-based ML Lifecycle Retrospective"},{"content":"MLflow handles the boundary between experiment and model in the ML lifecycle. On the experiment side it records \u0026ldquo;what parameters trained what.\u0026rdquo; On the model side it records \u0026ldquo;which version is production right now.\u0026rdquo; That boundary is not the exclusive problem of large ML teams. It shows up just as clearly when you\u0026rsquo;re running a single Logistic Regression model.\nML Lifecycle ML projects roughly move through four stages.\nExperiment — you look at the data, try parameters, train models. You log metrics and go back to try again. Model — once you have something worth keeping, you declare \u0026ldquo;this is our model.\u0026rdquo; A version and a lineage attach to it. Deployment — you put that model into the serving environment. Rollout, rollback, traffic shifting all live here. Monitoring — you watch the live model for drift and degradation. Each stage has its own problem. Experiment struggles with remembering what was tried. Model struggles with agreeing on which one is real right now. Deployment struggles with swapping one thing for another. Monitoring struggles with deciding when to retrain.\nMLflow mostly fills the first two. It reaches into deployment and monitoring, but its center of gravity is in experiment and model.\nFile-name Versioning It starts simple. You upload model.pkl to S3 and the inference server reads it. Every time training finishes, you overwrite the file.\nThen you need to roll back. Yesterday\u0026rsquo;s version. But the file has already been overwritten. So you start splitting: model_v2.pkl, model_v3.pkl. Before long you get model_v3_final.pkl. Then model_v3_final_really.pkl.\nThese names don\u0026rsquo;t solve three things.\nLineage — there is no way to trace model_v3_final.pkl back to the code, the data, and the parameters that produced it. You cannot reproduce it even with the same code. Alias — \u0026ldquo;which model is production right now\u0026rdquo; gets managed with a filename convention outside the code. Does the inference server read latest.pkl? Does it take a version via env var? Every decision becomes ad hoc. Reproducibility — a few months later you want to repeat an experiment, but nothing remembers the parameters and the code from that run. To fix these you eventually need a metadata layer on top of the \u0026ldquo;filename\u0026rdquo; layer. That is the gap MLflow fills.\nMLflow Components MLflow is four independent components in one package. You pick which ones you use.\nTracking It records a training session as a run. Parameters, metrics, and artifacts (model files, plots, logs) all attach to the run. Multiple runs group under an experiment.\nimport mlflow with mlflow.start_run(): mlflow.log_param(\u0026#34;C\u0026#34;, 0.1) mlflow.log_metric(\u0026#34;val_auc\u0026#34;, 0.782) mlflow.sklearn.log_model(model, \u0026#34;model\u0026#34;) This chunk is the seed of lineage. Months later you can ask \u0026ldquo;val_auc was 0.78, what were the parameters?\u0026rdquo; and have an answer.\nModel Registry If Tracking records how it was trained, Registry records which result you want to declare yours. You promote one of the logged artifacts into a registered model and a version number attaches. v1, v2, v3 stack up automatically.\nOn top of those versions you can attach an alias. The champion alias is a mutable reference that points to a specific version. When a new version passes validation, you move the champion alias. No code changes, no filename rules, just one alias moving. The \u0026ldquo;model that production points at\u0026rdquo; is replaced.\nmlflow.register_model(\u0026#34;runs:/\u0026lt;run-id\u0026gt;/model\u0026#34;, name=\u0026#34;ctr-model\u0026#34;) client.set_registered_model_alias(\u0026#34;ctr-model\u0026#34;, \u0026#34;champion\u0026#34;, version=7) Registry clears the entire model_v3_final.pkl problem. Lineage auto-links to runs, aliases replace filename rules, and reproduction is a matter of looking up a run id.\nOne important constraint: using Registry requires a database backend. File storage alone (./mlruns) does not expose the registry API. Even if you want to start light, you have to stand up at least SQLite, or PostgreSQL / MySQL for real use. Since MLflow 3.7.0 the default backend switched to SQLite, which lowers the first entry barrier a little.\nModels This piece standardizes what \u0026ldquo;a model file\u0026rdquo; means. Each framework (sklearn, pytorch, xgboost) gets a flavor, and the same model can be saved under multiple flavors. A saved model can be loaded without the original framework code.\nModels is the portability layer between experiment and deployment. Where Tracking and Registry deal with which model is it, Models deals with how is it serialized.\nProjects An MLproject file plus conda or docker config wraps everything so that \u0026ldquo;anyone running it gets the same environment.\u0026rdquo; mlflow run . sets up the environment and runs training.\nThis is the least used of the four. Teams with their own batch execution standard usually don\u0026rsquo;t add an MLproject layer; they keep their own standard.\nLifecycle Mapping flowchart LR E[Experiment] --\u003e|\"Tracking(run, param, metric)\"| M[Model] M --\u003e|\"Registry(version, alias)\"| D[Deployment] M -.-\u003e|\"Models(flavor)\"| D P[Projects] -.-\u003e|\"runtime env\"| E D --\u003e Mo[Monitoring] Tracking: inside the experiment stage Registry: in the model box between experiment and deployment Models: the portability axis from model into deployment Projects: an optional reproducibility layer on experiment Monitoring is not covered by MLflow directly. It\u0026rsquo;s another tool\u0026rsquo;s job. That diagram shows MLflow\u0026rsquo;s scope most concisely. Four pieces, each playing its own role, and your project picks which slots to fill.\nChoosing Tracking and Registry Picture a lightweight LR model running in production. The combination you most often see, out of the four pieces, is two: Tracking and Registry.\nWhy Tracking. The training batch re-runs LR every cycle, and each run has different parameters and validation metrics. You need to trace, later, which run produced which number. The records pile up faster than filenames can describe them. Tracking fills exactly that gap.\nWhy Registry. Only the models that pass the validation step should become \u0026ldquo;champion.\u0026rdquo; The inference server loads that champion. If you manage this with filename conventions, the server ends up polling latest.pkl and you get a race where an unvalidated model reaches production before validation finishes. Aliases remove that race. The actor pulling the deploy trigger and the object being deployed are cleanly separated.\nMoving the alias and swapping the inference server pods are two different events. Once the alias moves, a deployment tool (for example, Argo Rollouts) triggers the pod replacement. Rollouts starts new pods; each new pod, on boot, loads the model that champion currently points at and joins the service. MLflow says \u0026ldquo;which one is champion,\u0026rdquo; and the deployment tool handles \u0026ldquo;how to place it into service.\u0026rdquo;\nThis separation is the point. MLflow does not need to do everything. It just needs to fill its boundary.\nComponents Not Used Models format comes along for free when you log models through Tracking. You don\u0026rsquo;t pick it explicitly, but you get its benefits. Registry can return the model as runs:/\u0026lt;id\u0026gt;/model because of this format.\nProjects is often skipped. If a team already has a stable batch execution standard, adding an MLproject layer is duplication. When a batch runs inside a single framework, the reproducibility win from Projects is small.\nServing is also optional. MLflow offers its own serving endpoint (mlflow models serve), but handling lightweight-model inference directly in an existing server with sklearn is often lighter and easier to integrate with existing infrastructure. Delegating the serving layer to MLflow is rarely justified.\nUsing two pieces out of four is not \u0026ldquo;half using\u0026rdquo; MLflow. Filling only the boundary you need and leaving the rest to other tools is, if anything, closer to how this tool is meant to be used.\nClosing The word was \u0026ldquo;boundary.\u0026rdquo; That boundary is where meta-information (when, how, with what, which one is real right now) starts piling up faster than filenames can describe it. MLflow is the lightweight metadata layer at that point. How lightweight depends on you.\nIt isn\u0026rsquo;t a tool for large ML teams only. Even running a single LR, the same boundary shows up. When it does, you fill the slots you need.\n","permalink":"https://wid-blog.github.io/en/posts/tech/ml/mlflow/","summary":"Which slot of the ML lifecycle each MLflow component fills, and which pieces a lightweight team can pick.","title":"MLflow and the ML Lifecycle"},{"content":"A Circuit Breaker blocks calls heading toward a failing dependency, so the caller\u0026rsquo;s resources do not stay occupied by a dependency that is failing. When dependency call failures accumulate on the caller\u0026rsquo;s threads and connections, the accumulated failures degrade the caller\u0026rsquo;s own state, and one dependency\u0026rsquo;s outage spreads as a cascade along the call chain.\nThe trip criterion and the recovery method look like separate decisions, but when the two sides do not align, the system oscillates between trip and recovery. A precise trip combined with a simplistic recovery is the canonical example of the cycling anti-pattern.\nState Model A Circuit Breaker has three states.\nClosed: Normal operation. Calls pass through. Open: Tripped. Calls fail immediately (fail-fast). Half-Open: Recovery probe. Only limited calls pass through to test the dependency. Transition triggers sit at three points: Closed → Open (trip criterion), Open → Half-Open (recovery probe criterion), and Half-Open → Closed/Open (recovery verification criterion). Every library follows this model.\nTrip Triggers Three criteria commonly drive the Closed → Open transition.\nFailure rate based trips when the failure rate in a sliding window exceeds the threshold. It fits stable traffic where statistical judgment carries weight. The window has to be large enough not to swing with noise.\nLatency or slow-call based trips on the ratio of calls whose response time exceeds the threshold. It addresses the case where the dependency is alive but slow. Responses still come back, so the failure rate stays low, but the caller\u0026rsquo;s resources stay occupied longer and the effect ends up the same. It fits environments where fail-fast matters.\nCount based trips when consecutive failures exceed the threshold. It is the simplest and reacts the fastest. It is the first candidate when traffic is low and statistical judgment is hard, or when the failure event itself is a clearer signal than response time.\nThe trigger is decided by the shape of the failure signal, but the protection that tripping was meant to deliver only becomes complete when the recovery strategy is designed alongside it.\nRecovery Strategies Two criteria commonly drive the Open → Half-Open → Closed transition.\nTimeout based transitions to Half-Open automatically after a set time. It attempts recovery by looking at time alone. It fits dependencies with a self-recovery pattern (temporary GC pressure, short network drops, and the like).\nGradual pass rate lets a subset of calls through in Half-Open and watches the success rate. Above a threshold it returns to Closed; below, it returns to Open. It is a way to verify the recovery. More accurate, more complex to implement, and it allows a small probe load on the dependency.\nPick simplicity and you pick Timeout; pick accuracy and you pick gradual. The two strategies are not mutually exclusive — a hybrid that enters Half-Open on a timeout and then uses the pass rate to decide Closed/Open is common in practice. Which one fits is decided inside the pair with the trip trigger.\nPair Matrix The combination of trip trigger and recovery strategy decides a Circuit Breaker\u0026rsquo;s identity. Some combinations fit; some do not.\nTrip Trigger Recovery Strategy Fit Notes Failure rate Gradual pass rate Dominant Both share a statistical frame Latency / slow-call Gradual pass rate Dominant Re-verifies whether latency has cleared Count based Timeout Acceptable Both favor simple, fast reaction Failure rate Timeout Risky Precise trip ↔ simplistic recovery asymmetry → cycling Failure rate × Gradual is dominant because both sides judge statistically and stay consistent. With the trip criterion as the failure rate in a window and the recovery criterion as the success rate of Half-Open passes, entry and recovery share the same statistical frame.\nLatency × Gradual is also dominant. When the dependency is alive but slow, the recovery point has to verify whether it is still slow. Resuming full load on a timeout alone leaves a high chance of returning to Open the moment latency returns. For the same reason, Latency × Timeout also belongs in the risky category.\nCount × Timeout is an acceptable combination. Both sides favor simplicity and quick reaction, so they line up with operational simplicity. For dependencies with a self-recovery pattern in an environment that tolerates short cycles, it is enough.\nFailure rate × Timeout is risky. The trip side decides statistically and cautiously, while the recovery side resumes full load on time alone, without verification. If the dependency has not recovered, the failure rate crosses the threshold again immediately and the breaker returns to Open. The result is meaningless cycling — the breaker oscillates between Closed and Open while the caller pays the fail-fast cost on each cycle. This cycling shows up when the precision on the trip side and the precision on the recovery side do not match.\nTool Mapping Real tools have settled on specific combinations of the two.\nTool Default Trip Trigger Default Recovery Why It Settled Resilience4j (Java) Failure rate + slow-call Gradual pass rate Statistical precision for business-unit protection Polly v8 (.NET) Failure rate (FailureRatio) Timeout (BreakDuration) .NET resilience standard integration Istio / Envoy Count based (consecutive 5xx) Timeout (ejection time) The sidecar has no business context Resilience4j defaults to the failure rate + gradual pair because method-level protection in business logic needs precise triggering and verified recovery together. The actual behavior combines both recovery strategies — waitDurationInOpenState triggers the automatic Open → Half-Open transition (Timeout based), and the failure rate of permittedNumberOfCallsInHalfOpenState passes decides Closed/Open (gradual pass rate).\nPolly v8 defaults to the failure rate + timeout pair because, as a standard component of .NET resilience, statistical judgment became the default. The FailureRatio threshold combined with MinimumThroughput filters noise before tripping, and BreakDuration triggers the transition to Half-Open afterward (up to v7 it was based on consecutive failure counts; v8 changed this).\nIstio / Envoy default to count + timeout because in the sidecar environment, consecutive 5xx responses are the clearest failure signal for an external call. The sidecar lacks business context, so it trips on a simple signal instead of a statistical judgment. outlier detection\u0026rsquo;s consecutive_5xx and base_ejection_time expose that pair directly.\nAll three tools have settled on the pair that fits their environment. Choosing a tool becomes a matter of picking the row whose pair fits your own environment, among the ones already settled.\nBulkhead A Circuit Breaker alone cannot block a cascade. Even with the breaker tripped on one dependency, if other dependency calls share the same resource pool (threads, connections), the already-occupied resources do not get released. The resource pool can run out before the trip takes effect.\nThe Bulkhead pattern solves this by isolating resources per dependency. Allocating an independent thread pool (or semaphore slot) to each dependency means one dependency\u0026rsquo;s failure does not consume resources used by other dependency calls. Circuit Breaker and Bulkhead are used together as a combined pattern. Applying only one leaves cascade blocking incomplete.\nDecision Order A Circuit Breaker\u0026rsquo;s decision is not a single decision but a paired design. Decide the trip trigger first by the shape of the failure signal (failure rate / latency / count), decide the recovery strategy as a pair with the trip trigger (simplicity → Timeout, accuracy → gradual), and combine with Bulkhead at the end to isolate resources per dependency.\nA Circuit Breaker\u0026rsquo;s trip trigger and recovery strategy cannot be decided separately. If the trip side is designed with precision, the recovery side needs matching verification; if the trip is kept simple, a simple recovery cycle is enough. The weight on both sides has to match for a Circuit Breaker to deliver the protection it was designed for.\nReferences Rate Limiting — The previous post that covers Rate Limit with the same two-decision constraint relationship between protection layer and algorithm. Ad system outage retrospective — shared dependencies and a single point of failure — A cascade case where Circuit Breaker and Bulkhead were both needed. ","permalink":"https://wid-blog.github.io/en/posts/tech/design-pattern/circuit-breaker/","summary":"A Circuit Breaker\u0026rsquo;s trip trigger and recovery strategy must be designed together. Trip without recovery cuts the dependency permanently; recovery without a trip basis becomes meaningless cycling.","title":"Circuit Breaker"},{"content":"Rate limiting is the device that keeps healthy instances from exhausting their resources before autoscaling can react. When a traffic spike arrives faster than new instances become ready, it rejects some requests early so the healthy ones do not reach their limits.\nHow to count is the algorithm question. Where to count is the protection layer question. The two look independent, but the protection layer narrows which algorithms are actually available. That is why the layer has to be chosen before the algorithm.\nProtection Layer Three layers are candidates: L4, L7, and Application. Identification precision and algorithm choice widen together as the layer moves inward.\nL4 (Load Balancer) sits the furthest outside. It counts at the TCP connection level, with identification limited to roughly IP. Processing cost is low, but the counting granularity is coarse, so only simple algorithms apply. Identification gets coarser still when clients are behind NAT.\nL7 (Gateway or Sidecar) operates at the HTTP level. Application identifiers like headers, paths, and tokens can serve as counting keys. Counts split by user and by API, so the algorithm choice widens. In a microservice environment, a sidecar (Envoy and the like) is the first candidate.\nThe Application layer sees business context. Different limits per user tier, protecting specific endpoints, splitting by authenticated token type — these decisions become possible. The most precise, and also the heaviest, with the added cost of counters distributed across instances.\nThe three layers form a clear trade-off. Outer layers identify coarsely but cost little, while inner ones grow more precise and more expensive.\nThe reason the layer choice narrows the algorithm candidate space lies in identification precision and counting granularity. At L4, where only IP/connection-level identification is possible, a precise algorithm like Sliding Window loses its precision advantage because the identification key is ambiguous. At Application, where authenticated user-level identification is possible, every algorithm operates meaningfully. Identification precision decides the algorithm\u0026rsquo;s utility itself. The algorithm comparison in the next section is a follow-up decision after this layer choice.\nAlgorithms The widely used algorithms are Token Bucket, Fixed Window, Leaky Bucket, and Sliding Window. They split into two groups by whether bursts are allowed.\nBurst-Allowing Group Token Bucket keeps a bucket where tokens refill at a constant rate, and each request consumes a token. Pass if tokens exist, reject otherwise. Quiet periods let tokens accumulate, allowing short bursts in proportion. It fits workloads that are quiet most of the time but spike briefly.\nFixed Window counts only within a time window. Bursts are allowed inside the window; the count resets at the window boundary. The simplest to implement, with one weakness — bursts are possible across the boundary (requests piled at the end of one window and the start of the next both pass through).\nBurst-Removing Group Leaky Bucket is the queue analogy where requests leak out at a constant rate. Even if input arrives in bursts, the output rate stays flat. When downstream can only handle a fixed rate and you must feed it at that pace, it becomes the first candidate. Calls sent to an external payment gateway are a typical case.\nSliding Window slides the time window as it counts. With no boundary concept, the boundary-burst problem of Fixed Window disappears. Precision is highest, but it must store each request timestamp separately, making memory and computation the heaviest.\nThe choice between the two groups depends on whether downstream can absorb bursts. If it can, the burst-allowing group buys operational simplicity. If it cannot, the burst-removing group guarantees flattening.\nTool Mapping Not every layer-algorithm combination is possible in practice. Real tools have settled on certain combinations.\nLayer Representative Tool Natural Algorithm Notes L4 Nginx (limit_req) Leaky Bucket Fits connection-level processing L7 Sidecar Istio / Envoy Token Bucket HTTP header-based identification Application Resilience4j (Java) Cycle based Business context aware Application Bucket4j (Java) Token Bucket Distributed backend support Nginx\u0026rsquo;s limit_req module runs as a Leaky Bucket. The structure of accepting connections and forwarding them at a fixed rate corresponds directly to Leaky\u0026rsquo;s output flattening. The burst option also allows absorption of short input bursts.\nIstio / Envoy\u0026rsquo;s rate limit filter defaults to Token Bucket because allowing per-client bursts identified by HTTP headers is a common requirement in gateway environments. The sidecar itself provides both local mode (within a single instance) and global mode (delegating to an external RLS server).\nResilience4j\u0026rsquo;s RateLimiter module runs on cycle-based counting. Every limitRefreshPeriod, limitForPeriod permissions are reset — unlike Token Bucket where tokens accumulate, here the count refreshes in a single step at each cycle boundary. It fits scenarios where simple cycle counting is enough for method-level protection, and ships as part of the same component set as Circuit Breaker, Retry, and others.\nBucket4j is a Token Bucket-dedicated library. It supports sharing counters across distributed environments through backends like Redis, making it a candidate when cluster-wide protection is needed instead of single-JVM protection.\nCombinations that have settled in practice: Leaky dominates at L4; Token dominates in L7 sidecars and distributed Application protection (Bucket4j). Cycle-based tools like Resilience4j fit single-JVM scenarios where simple counting is enough. Sliding Window is missing from the table because tools rarely default to it — it tends to be custom built or layered on top of a distributed counter.\nDecision Order A single flow emerges from the layout above. Traffic shape does not decide algorithms in isolation; the layer narrows them first, then traffic shape narrows them further within what the layer allows.\nNeed protection based on business context → Application layer → Token Bucket HTTP-level identification is enough → L7 → Token Bucket (local or global) Connection-level protection is enough → L4 → Leaky Bucket On top of that, whether bursts are allowed becomes the final refinement for the algorithm.\nWhen placing protection in front of a dependency that needs it, starting from the algorithm comparison narrows the tool space before you even choose the algorithm. The layer decision has to come first for the algorithm\u0026rsquo;s candidate space to open up.\nReferences Ad system outage retrospective — shared dependencies and a single point of failure — A real case of how rate limiting could have blocked the starting point of a cascade. ","permalink":"https://wid-blog.github.io/en/posts/tech/design-pattern/rate-limiting/","summary":"Before choosing a rate limit algorithm, the protection layer decides which algorithms are even available. This post lays out how L4/L7/Application layers and Token/Leaky/Sliding/Fixed algorithms intersect.","title":"Rate Limiting"},{"content":"An external event pushed ad traffic far above normal levels.\nAutoscaling tuned for baseline load couldn\u0026rsquo;t keep up. New PODs hadn\u0026rsquo;t become Ready before existing healthy PODs collapsed, and the shock propagated to the next system.\nThe immediate read was straightforward: traffic spike plus scaling lag. But once we unpacked the retrospective, the real problem sat elsewhere. The filtering component alone went down, and the primary and the fallback went down together.\nHow the Cascade Unfolded The filtering component gave way first. Traffic grew faster than HPA could spin up new PODs. CPU on existing healthy PODs crossed the limit before new replicas were Ready, and Readiness Failed events stacked up. A scale-out in progress while the existing PODs fell apart in parallel — the first link in the cascade.\nThe primary ad system was next. It calls the filtering component when picking ad candidates. As the filtering component degraded, those calls piled up as TIMEOUTs, and the accumulated failures eventually dragged the primary ad system\u0026rsquo;s own state into a bad place. It didn\u0026rsquo;t recover on its own — a manual restart was required.\nThe inference component came third. Once the primary ad system recovered, the requests that had been blocked released all at once. Load that had dropped sprang back to normal in a single jump, and 5xx responses appeared while HPA caught up.\nThe cascade kept extending even as each upstream piece recovered. The filtering component recovered while the primary ad system was still stuck. The primary ad system recovered, and the inference server staggered. Read as a timeline, it looked like three separate incidents. The underlying flow was one.\nDiagnosis — Shared Dependency and a Single Point of Failure Two surface causes stand out. The sudden load created by an external event. And the autoscaling that couldn\u0026rsquo;t match its pace.\nStopping there points the follow-up work toward \u0026ldquo;faster HPA, faster alerts, faster manual response.\u0026rdquo; All valid. All answers to the symptoms.\nPushing the retrospective one step further surfaced a different picture. Our ad system has an filtering component, and the primary ad system depends on it. We also kept a fallback system for failover. It sits idle most of the time and activates only when the primary can\u0026rsquo;t respond.\nInside that fallback, however, the filtering logic was wired to call the filtering component\u0026rsquo;s API directly to avoid duplicating logic.\nThat one wire changed everything.\nBoth the primary and the fallback were tied to the same filtering component. The filtering component was a single point of failure, and when that one point went down, both went down with it.\nA fallback that shares dependencies cannot absorb the primary\u0026rsquo;s load. It pushes additional traffic through the same struggling dependency and amplifies the cascade. The fallback server we\u0026rsquo;d stood up separately turned out to be effectively a second primary, sharing the same single point of failure.\nThere\u0026rsquo;s another layer. Why did that single point fall over so quickly? The filtering component runs CPU-bound work — ad candidate evaluation, filtering — and Node\u0026rsquo;s single-threaded event loop doesn\u0026rsquo;t fit that profile well. CPU limits arrived faster on each POD, and that\u0026rsquo;s part of why the cascade\u0026rsquo;s first link started as quickly as it did.\nThe real problem of this outage was not the external event, and not the autoscaling lag. The filtering component was a single point of failure, the fallback was tied to it, and the point itself was structured to hit its CPU ceiling fast.\nRecovery With the diagnosis clear, the fix split into three paths. One separates the two sides by removing the fallback\u0026rsquo;s shared dependency. Another hardens the single point itself so it doesn\u0026rsquo;t collapse all at once. The third lifts the throughput of the point so the limit arrives later. None substitutes for the others. All three are needed to keep the single point of failure from becoming a cascade.\nRemoving the Fallback\u0026rsquo;s Filtering Dependency The first path is to give the fallback its own filtering logic.\nOperational cost goes up. Duplicating the logic and its data into the fallback adds synchronization work. The original choice to call the filtering component\u0026rsquo;s API came from that cost trade-off, and the decision wasn\u0026rsquo;t unreasonable at the time.\nThis outage shifted the weight on that trade-off. The savings ate the fallback\u0026rsquo;s definition. We had traded the reason for the fallback\u0026rsquo;s existence against its operating cost. Looking again, one side is clearly heavier.\nOne option worth exploring is hosting the fallback\u0026rsquo;s filtering on serverless infrastructure. Idle is the fallback\u0026rsquo;s normal state, so serverless\u0026rsquo;s zero idle cost matches its profile. Independence comes back without the full operational tax.\nRate Limiting on the Filtering Component The second path hardens the filtering component itself.\nThe first link of the cascade was the pattern of existing healthy PODs collapsing before scale-out finishes. Autoscaling is reactive by design — it kicks in after load arrives, so there\u0026rsquo;s always a gap when a spike hits. During that gap, the healthy PODs need protection from being dragged to their limit.\nRate limiting fills that role. It sheds a portion of requests until new PODs are Ready, keeping the healthy ones from being pushed past their threshold. A circuit breaker can do something similar by cutting requests to a struggling dependency once certain conditions hold. Either way, the goal is to keep the single point of failure from collapsing all at once.\nIf fallback independence separates the two sides, rate limiting hardens the point itself. The single point of failure gets addressed from both directions.\nRuntime Reassessment The third path lifts the throughput of the point itself.\nReviewing the filtering component\u0026rsquo;s workload makes clear that Node was a mismatch for it. CPU-heavy evaluation and filtering dominate the work, and a single-threaded event loop turns each in-flight request into a delay for the next.\nFor more throughput on the same POD budget, a systems-language runtime fits better. Runtime reassessment was in progress. If rate limiting keeps the single point from collapsing all at once, reconsidering the runtime raises the ceiling of that point. The two reinforce each other from different angles.\nThe migration stayed at the review stage, but given that the cascade\u0026rsquo;s first link came from hitting the CPU limit, the direction still reads as a valid option.\nRemaining Follow-ups A few operational changes round out the picture. Switching HPA\u0026rsquo;s target from CPU utilization to request count catches load increases earlier. CPU is a signal of the result of load; request count is closer to the cause. The change moves detection upstream.\nManual scale-out by editing k8s configuration directly is slow. Routing it through a Slack bot trims that time significantly.\nNeither is as fundamental as the three paths above, but both shorten the subsequent links of the cascade.\nWhat I Took Away The external event that day was only a trigger. Autoscaling\u0026rsquo;s limits were real. Above all of it sat the structure we had built. The filtering component was a single point of failure, the fallback was tied to it, and the point itself was structured to hit its CPU ceiling fast.\nFollowing the cascade as a timeline pulls the answer toward \u0026ldquo;faster HPA, faster alerts.\u0026rdquo; The retrospective\u0026rsquo;s value sat in the question one step beyond that: why didn\u0026rsquo;t the fallback stop the cascade. From there came the diagnosis that the filtering component had been the single point of failure for both sides all along. Pushing once more brought the question of why that point collapsed so quickly. The gap between surface diagnosis and structural diagnosis was where the lesson lived.\nA single point of failure becomes a cascade when the fallback shares the same dependency. Separating the two sides, hardening the point itself, and raising its throughput is how the system gets stronger.\nReferences Ad Fallback Server Design — The original design retrospective of the fallback system. This outage came from a single line in that design: the fallback\u0026rsquo;s dependency on the filtering component. Rate Limiting — Protection layer and algorithm patterns that could have blocked the cascade at its starting point. Circuit Breaker — Circuit Breaker + Bulkhead pattern for blocking cascades on shared dependencies. ","permalink":"https://wid-blog.github.io/en/posts/career/dable/cascading-failure-retrospective/","summary":"An external event drove ad traffic far above normal, triggering a cascading failure. The real problem was that the filtering component was a single point of failure — and the fallback sat on top of it too, so one collapse pulled both down at once. The fix took three paths: removing the fallback\u0026rsquo;s dependency (independence), adding rate limiting to the component itself (protection), and reconsidering the runtime (throughput).","title":"Ad System Outage Retrospective — A Shared Dependency and a Single Point of Failure"},{"content":"sklearn and ONNX aren\u0026rsquo;t answers to the same question. The moment you line them up with \u0026ldquo;what should I use to train my LR?\u0026rdquo; the comparison turns into an illusion. One is a framework for training models. The other is a format for shipping already-trained models. They don\u0026rsquo;t operate at the same layer.\nThis post isn\u0026rsquo;t about choosing between sklearn and ONNX. It\u0026rsquo;s about why \u0026ldquo;sklearn or ONNX?\u0026rdquo; isn\u0026rsquo;t a well-formed question to begin with.\nPrerequisites Search \u0026ldquo;sklearn vs ONNX\u0026rdquo; and the two tools come back stacked side by side as if they were competing for the same role. Pros and cons, benchmarks, usage examples — all arranged as parallel choices. That arrangement is what creates the illusion.\nsklearn is a library that takes data and trains models. LogisticRegression, RandomForest, GradientBoosting — training algorithms and their implementations. When training finishes, you save the resulting model as a .pkl file and reload it in a Python process to run predictions. Training through serving, the entire workflow stays within the Python ecosystem.\nONNX has no training algorithms. What ONNX provides is a framework-neutral way of representing a model that has already been trained. A transformer trained in PyTorch and a logistic regression trained in sklearn can both be converted into the same ONNX graph. From there, any compatible runtime can execute that graph.\nPut plainly — one is a trainer, the other is transport. Asking which to pick between \u0026ldquo;a trainer and a transport\u0026rdquo; is a malformed question. Either they move together, or the transport isn\u0026rsquo;t needed at all.\nsklearn sklearn does two things at once. It trains models, and it stores those models as Python objects you can reload later.\nfrom sklearn.linear_model import LogisticRegression import joblib model = LogisticRegression() model.fit(X_train, y_train) joblib.dump(model, \u0026#34;model.pkl\u0026#34;) That .pkl file follows Python\u0026rsquo;s native serialization format. Nothing outside Python can read it. You need the same sklearn version and the same NumPy version installed to reload it safely. In return, training, storage, and serving connect in a single pipeline with no seams.\nMost ML code trains in Python and serves from a Python process. If nothing in that path demands another layer, sklearn\u0026rsquo;s native storage format is the shortest route.\nONNX ONNX is a framework-neutral intermediate representation (IR). It records a model\u0026rsquo;s compute graph in a standardized opset, and a separate runtime such as ONNX Runtime reads that graph and executes it.\nInserting this one extra step unlocks a few things.\nLanguage boundary — a model trained in PyTorch or sklearn can run for inference in C++, C#, Java, or Rust. No Python needed. Hardware boundary — ONNX Runtime provides graph optimizations and hardware-specific execution providers. The same model runs on CPU, CUDA GPU, TensorRT, CoreML, and more. Framework boundary — when the team has PyTorch models and TensorFlow models mixed together and wants a single serving stack, ONNX becomes the common denominator. If those boundaries actually exist in your project, the ONNX layer justifies its cost. If they don\u0026rsquo;t, the layer is nothing more than an extra step in the pipeline.\nONNX Runtime Performance \u0026ldquo;ONNX Runtime is faster\u0026rdquo; is a claim you hear often. It\u0026rsquo;s half-true.\nONNX Runtime can apply graph optimizations (operator fusion, constant folding) and plug into hardware accelerators (CUDA, TensorRT, OpenVINO). In those cases, it can run a given model faster than the native framework. The important word is can.\nFor those gains to actually show up, at least one of the following usually has to be present.\nA GPU or dedicated accelerator A non-Python runtime that sidesteps the GIL A graph large enough that optimization yields meaningful gains Logistic regression meets none of these conditions. It\u0026rsquo;s a single dot product between the weight vector and the input vector. Graph fusion has almost nothing to fuse. On a CPU, expecting a meaningful latency difference between ONNX Runtime and sklearn for LR inference isn\u0026rsquo;t realistic.\nSo \u0026ldquo;ONNX is faster\u0026rdquo; is a sentence that isn\u0026rsquo;t actually true until you also specify which model and which environment.\nONNX Adoption Criteria Rather than abstract decision rules, it\u0026rsquo;s more useful to list the concrete conditions under which adding ONNX clearly pays off.\nTraining language and serving language differ. Training runs in Python; inference has to run inside a C++/Java/Go service. ONNX bridges the gap. GPU or edge inference is required. The model is large, latency requirements are tight, or it has to live on an edge device. ONNX Runtime\u0026rsquo;s execution providers support those targets. Multiple frameworks need to converge on one serving stack. PyTorch, sklearn, and TensorFlow models all have to run on the same inference server. ONNX becomes the common format. Training code and serving infrastructure have different lifecycles. You want the training code refactored and version-bumped frequently, but the serving binary has to stay stable. ONNX gives you a fixed point in between. If none of those match your situation, what you actually get from adding ONNX is an extra conversion step, opset version compatibility to worry about, and float/double precision edge cases to debug. Cost without payoff.\nLightweight LR Scenario Consider a lightweight LR model running on a Python training plus Python serving path. GPU inference isn\u0026rsquo;t needed. The model is the size of a single weight vector. There\u0026rsquo;s no plan to run models from other frameworks alongside it. None of the four conditions above applies.\nIn that setup, the real decision isn\u0026rsquo;t \u0026ldquo;should we use ONNX?\u0026rdquo; — it\u0026rsquo;s \u0026ldquo;does an ONNX layer belong in this architecture?\u0026rdquo; It doesn\u0026rsquo;t. sklearn\u0026rsquo;s native .pkl storage is the shortest path from training to serving.\nSummary Back to the starting question. \u0026ldquo;sklearn or ONNX?\u0026rdquo; isn\u0026rsquo;t in a form that can be answered. The two tools don\u0026rsquo;t operate at the same layer.\nThat question has to be split in two. One half is \u0026ldquo;which library should I train with?\u0026rdquo; — a choice between sklearn, PyTorch, XGBoost, and other training frameworks. The other half is \u0026ldquo;what format should the trained model ship in?\u0026rdquo; — which can be each framework\u0026rsquo;s native storage format, or ONNX.\nOnce you split it, \u0026ldquo;do I need an ONNX layer?\u0026rdquo; becomes independent of the training framework question. And for most lightweight models, that question closes fast with a \u0026ldquo;no\u0026rdquo;. There\u0026rsquo;s no reason to add a layer where none is needed.\nTwo tools that aren\u0026rsquo;t answers to the same question give awkward answers whenever you force them into the same question. Rewrite the question first.\n","permalink":"https://wid-blog.github.io/en/posts/tech/ml/model-training-frameworks/","summary":"sklearn and ONNX aren\u0026rsquo;t competing at the same layer. Once you separate their roles, the real question becomes \u0026lsquo;do I need an ONNX layer at all?\u0026rsquo;","title":"Choosing a Model Training Framework: sklearn vs ONNX"},{"content":"When picking a baseline for CTR prediction, the candidates are many. Gradient Boosting, Neural Networks, and Logistic Regression. Among them, LR is still often selected as the baseline. There are reasons for that.\nLR Properties Lightweight. The model is a single dot product. Training and inference both scale linearly with the number of features.\nInterpretable. Every coefficient directly indicates \u0026ldquo;how much this feature contributes to the outcome.\u0026rdquo;\nProbability output. It outputs values between 0 and 1. In ads, you multiply those directly against a bid.\nModel Structure The most direct way to understand Logistic Regression is to start from linear regression.\nLinear regression outputs a weighted sum of the inputs.\n$$ z = w \\cdot x + b $$The problem is that $z$ ranges over all real numbers. To produce a probability like CTR, the output must lie between 0 and 1. Linear regression does not guarantee that.\nThe sigmoid function solves this.\n$$ \\sigma(z) = \\frac{1}{1 + e^{-z}} $$Sigmoid smoothly compresses the entire real line into $(0, 1)$. No matter how large the input, it approaches 1; no matter how small, it approaches 0. Pass the output of linear regression through sigmoid, and you get a probability.\nThis simple composition is all there is to Logistic Regression: a linear model with a probability layer on top.\nOne thing worth noting. The probability output is nonlinear, but the decision boundary, the surface that separates the two sides at probability 0.5, remains linear. The hyperplane $w \\cdot x + b = 0$ is itself the boundary. LR is \u0026ldquo;a linear classifier with probabilities bolted on.\u0026rdquo;\nlog-loss Once the model structure is fixed, training becomes \u0026ldquo;finding good $w$ and $b$.\u0026rdquo; We need a criterion for \u0026ldquo;good.\u0026rdquo;\nLinear regression uses MSE. LR does not. The reason lies in the output shape.\nLR\u0026rsquo;s output is a probability. There is a more suitable choice of loss for probabilistic models: log-loss (a.k.a. cross-entropy).\n$$ L = -\\frac{1}{N} \\sum_{i=1}^{N} \\left[ y_i \\log \\hat{y}_i + (1 - y_i) \\log (1 - \\hat{y}_i) \\right] $$When the label is 1, the loss shrinks as $\\log \\hat{y}$ grows; when the label is 0, the loss shrinks as $\\log(1 - \\hat{y})$ grows. The closer the predicted probability gets to the truth, the closer the loss gets to zero.\nLog-loss is convex for LR. No local minima. The optimization converges to the global optimum. This property is the mathematical reason LR trains quickly on large-scale data.\nWhy These Properties The three characteristics from the overview, lightweight, interpretable, probability output, all follow from the structure above.\nLightweight A trained LR model is ultimately a weight vector $w$ and a bias $b$. Inference is one dot product and one sigmoid. Whether you have a million features or ten million, the computation scales linearly with the feature count. Compared to the many multiplications and nonlinearities in tree ensembles or neural networks, LR requires far less computation.\nInterpretable A coefficient $w_i$ means \u0026ldquo;when feature $i$ increases by one unit, the log-odds shift by $w_i$.\u0026rdquo; The sign indicates direction; the magnitude indicates influence. When you want to know \u0026ldquo;which feature contributes positively to clicks\u0026rdquo; in the ad domain, LR answers with a single table of coefficients. This satisfies the accountability requirements on the operations side.\nProbability Output Many classifiers output only a ranking score. LR outputs a calibrated probability. Ad expected-value math requires multiplying that number directly: predicted CTR × bid = expected revenue. A score that is not a probability cannot be used directly in the bidding formula.\nCTR Prediction Looking at CTR prediction as a problem reveals why LR fits.\nSparse. Most features are one-hot-encoded categoricals. Out of millions of dimensions, only a handful are 1; the rest are 0.\nHigh-dimensional. The cross-product of ad, user, and context spreads across millions to hundreds of millions.\nLarge-scale. Training data accumulates in large daily volumes.\nLR aligns with all three. The dot product of a sparse vector only needs to touch the non-zero entries, so the computation scales with the actual count of populated features, not the raw dimensionality. Training is easy to distribute via the SGD family. Inference fits inside the tight latency budget of real-time bidding.\nWhen bringing up a CTR model for the first time, these characteristics become decisive. You need to establish a baseline quickly, covering the training pipeline, serving, and monitoring, and validate the entire lifecycle first. A more complex model delays that validation itself.\nLimits and What Comes Next Having seen why LR serves as the baseline, we should also see why it is eventually replaced.\nThe biggest limit is the absence of nonlinear interactions. Products of features, conditional effects, complex combinations. LR cannot discover those on its own. A human has to define them in advance through feature engineering. As feature combinations grow, the engineering cost increases and operations become constrained by feature-design reviews.\nSo when do you move on? When data and operational headroom reach a point \u0026ldquo;feature engineering can no longer absorb.\u0026rdquo; Gradient Boosting Decision Trees learn interactions on their own. Neural networks go further, converting high-cardinality categoricals into continuous vectors through embeddings. Both directions address exactly LR\u0026rsquo;s limits.\nThat said, LR remains a reasonable starting point. Without a baseline, if you start with a complex model, you cannot distinguish the model\u0026rsquo;s contribution from the pipeline\u0026rsquo;s. The numbers LR provides become the reference line for every comparison that follows.\nClosing Choosing the old model had its reasons.\nThose reasons are in the structure. The composition of a linear model and sigmoid, the convexity of log-loss, the efficiency in sparse, high-dimensional spaces. Together, these three keep LR as the baseline for CTR prediction.\nEven when the time comes to move to the next model, the numbers LR provided remain as the baseline.\n","permalink":"https://wid-blog.github.io/en/posts/tech/ml/logistic-regression/","summary":"The structure and characteristics of Logistic Regression, and why an old model still serves as the baseline in CTR prediction.","title":"Revisiting Logistic Regression"},{"content":"The deploy was two days old and metrics had been calm the whole time. That evening we removed the old cache refresh batch and cleared the lingering cache; external ad responses stopped. That was when we realized the change from two days earlier had never actually taken effect.\nTimeline On 11-25 we shipped what was supposed to be a switch to the new cache module. Monitoring stayed normal, user traffic looked fine. But because of a package version-lock mistake, the code referencing the new module was there while the dependency was still pinned to the old one. The system was reading the old cache keys, and a separate batch kept refreshing the old cache, so nothing looked broken.\nAt 18:20 on 11-27 we removed the old cache refresh batch. The cache stayed valid while the TTL ran out; at 18:39 we manually cleared what was left, and old cache misses started immediately. Nothing was writing to the new cache keys, so the system could not fetch campaign data. External ad responses stopped at 18:40, an external SSP team reported the outage at 19:23, and recovery completed at 20:05.\nRoot Cause A deploy being done and the change being applied are two different facts. Until that evening, monitoring had only been reporting \u0026ldquo;deploy finished, system normal.\u0026rdquo; \u0026ldquo;Did we actually switch to the new module?\u0026rdquo; was not captured by any metric.\nThe signal in this case was obvious in hindsight. The lookup rate on the new cache keys should have been non-zero, and the lookup rate on the old keys should have been trending to zero. Comparing the two right after the 11-25 deploy would have surfaced the broken state immediately. But that metric did not exist. \u0026ldquo;Operating correctly on the old cache\u0026rdquo; and \u0026ldquo;operating correctly on the new cache\u0026rdquo; were indistinguishable at the metric level.\nTwo secondary factors delayed detection. When campaign data was empty, the system still returned HTTP 200; the internal placements\u0026rsquo; Fallback mechanism absorbed the alarm signal. Together they kept the dashboard looking calm right after the cache misses began. They were not the root cause though. The root cause had been sitting quietly since two days earlier.\nRetrospective I think post-deploy monitoring should lead with \u0026ldquo;did the change actually take effect?\u0026rdquo; before \u0026ldquo;is the main metric steady?\u0026rdquo; Dependency version hashes, new cache key lookup rates, old cache key lookup rates — whatever form it takes, before-and-after needs to be separable as a metric. Without it, the broken state goes quiet and blows up on its own schedule.\nLooking back, two days of calm was the most dangerous signal in this incident. A flat metric line looks the same whether the change took effect or never did.\n","permalink":"https://wid-blog.github.io/en/posts/career/dable/campaign-cache-miss-retrospective/","summary":"The deploy was two days old, and the metrics had been calm the whole time. The moment we turned off the cache refresh batch, ad serving stopped. A retrospective on the missing verification of what a deploy actually changed.","title":"The Blind Spot in Deploy Change Verification — Campaign Cache Incident Retrospective"},{"content":"Kubernetes (k8s) is a container orchestration platform. It automates deploying, scaling, and recovering large numbers of containers across a cluster. Beyond a single server where Docker handles things directly, once container counts grow to dozens or hundreds, questions like who restarts a terminated container, how to scale on traffic spikes, and how containers communicate need a declarative model.\nCluster Architecture A k8s cluster consists of the Control Plane and Worker Nodes.\nflowchart TB subgraph cp[\"Control Plane\"] API[\"API Server\"] ETCD[\"etcd\"] SCHED[\"Scheduler\"] CM[\"Controller Manager\"] end subgraph wn1[\"Worker Node\"] KL1[\"kubelet\"] KP1[\"kube-proxy\"] CR1[\"Container Runtime\"] P1[\"Pod\"] P2[\"Pod\"] end subgraph wn2[\"Worker Node\"] KL2[\"kubelet\"] KP2[\"kube-proxy\"] CR2[\"Container Runtime\"] P3[\"Pod\"] end API --\u003e SCHED API --\u003e CM API --\u003e ETCD API --\u003e KL1 API --\u003e KL2 Control Plane The set of components that manage the entire cluster.\nAPI Server is the entry point for all requests. Whether from kubectl or internal components, every k8s operation goes through it. etcd is a distributed key-value store holding the cluster state — which Pods run where, which Deployments exist, and so on.\nScheduler decides which Node should run a newly created Pod, considering resource availability and affinity rules. Controller Manager watches whether the current state matches the declared state. If a Deployment says replicas: 3 but only 2 Pods exist, the Controller creates one more.\nWorker Node The servers where containers actually run.\nkubelet manages Pod lifecycles on each Node. It receives instructions from the API Server and starts containers through the Container Runtime. kube-proxy manages Node-level networking rules, routing traffic that arrives at a Service to the appropriate Pod.\nEnd-to-End Flow Running kubectl apply -f deployment.yaml sends the request to the API Server. It stores the desired state in etcd. The Scheduler picks a Node for each Pod. The Node\u0026rsquo;s kubelet creates the containers. The Controller Manager continues monitoring and correcting any drift between declared and actual state.\nWhat backend developers interact with directly is kubectl and YAML manifests. The rest is handled internally by k8s.\nCore Objects Pod A Pod is the smallest deployable unit in k8s. Containers are wrapped in Pods because containers in the same Pod share a network namespace and storage. This enables sidecar patterns like placing a log collector or proxy alongside the main container.\nIn most cases, one Pod contains one container. Pods are rarely created directly; Deployments manage them.\nDeployment A Deployment declaratively manages Pods. Declare \u0026ldquo;maintain 3 Pods with this image\u0026rdquo; and k8s automatically creates the 3 Pods and restarts any that terminate. It is the most frequently used object when deploying backend services.\nThe default deployment strategy is a rolling update. New Pods spin up one by one while old Pods shut down one by one. The service stays available throughout. If something goes wrong, kubectl rollout undo reverts to the previous version.\napiVersion: apps/v1 kind: Deployment metadata: name: api-server spec: replicas: 3 selector: matchLabels: app: api-server template: metadata: labels: app: api-server spec: containers: - name: api image: api-server:1.2.0 ports: - containerPort: 8080 Service Pods are created and destroyed frequently, and their IPs change. Using Pod IPs directly is unreliable. A Service provides a stable access point to a set of Pods.\nClusterIP assigns a virtual IP accessible only within the cluster. It is the primary choice for inter-service communication. NodePort opens a specific port on each Node for external access. LoadBalancer automatically provisions a cloud load balancer.\nClusterIP is the most common choice in backend development. Call another service at http://service-name:port and k8s DNS resolves it to the Service\u0026rsquo;s ClusterIP. No need to implement service discovery separately.\nNamespace A Namespace logically partitions a cluster. It isolates environments like dev, staging, and production within the same cluster. Resource names only need to be unique within a Namespace.\nNetwork Ingress If a Service provides access within the cluster, Ingress routes external traffic to internal Services. It distributes traffic based on domain names or URL paths. Since it enables path-based routing without a separate API Gateway, it is frequently used in backend service architectures.\napiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: api-ingress spec: rules: - host: api.example.com http: paths: - path: /users pathType: Prefix backend: service: name: user-service port: number: 80 - path: /orders pathType: Prefix backend: service: name: order-service port: number: 80 api.example.com/users routes to user-service, /orders to order-service. Ingress defines the rules; an Ingress Controller (nginx, traefik, etc.) handles the actual traffic.\nConfiguration and Storage ConfigMap and Secret Embedding configuration in a container image forces a rebuild for every change. A ConfigMap separates configuration data into its own object. It injects values as environment variables or mounts them as volumes.\nA Secret has the same structure as a ConfigMap but stores sensitive information like passwords and API keys. Values are base64-encoded (not encrypted), but combined with RBAC, access control is possible.\nIn backend development, ConfigMap and Secret handle the separation of DB connection strings, external API keys, and similar configuration from code.\nPersistentVolume Pod deletion erases internal data. Workloads like databases need persistent storage. A PersistentVolume (PV) is storage pre-provisioned by a cluster administrator. A PersistentVolumeClaim (PVC) is how a Pod requests a PV. The Pod only knows about the PVC — it does not need to know where the actual storage resides.\nHealth Checks For k8s to automatically judge Pod health, the application must expose its status. Backend developers implement this directly.\nreadinessProbe checks if a Pod is ready to receive traffic. Unready Pods are excluded from Service routing. Useful when the server needs cache warm-up before accepting requests.\nlivenessProbe checks if a Pod is functioning normally. Failure triggers a restart. It detects deadlocks or unresponsive states.\nstartupProbe is for slow-starting applications. It defers liveness/readiness checks until startup completes.\ncontainers: - name: api image: api-server:1.2.0 readinessProbe: httpGet: path: /health/ready port: 8080 periodSeconds: 5 livenessProbe: httpGet: path: /health/live port: 8080 periodSeconds: 10 Implement /health/ready and /health/live endpoints in the backend server. Readiness typically checks DB connections and external dependencies. Liveness checks only whether the server process itself is alive.\nScaling Manual Scaling kubectl scale deployment api-server --replicas=5 This directly changes the Deployment\u0026rsquo;s replica count. Suitable when traffic patterns are predictable or when pre-scaling for a known event.\nHPA Unpredictable traffic makes manual scaling impractical. HPA (Horizontal Pod Autoscaler) adjusts Pod count based on metrics.\napiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: api-server-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: api-server minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 This says \u0026ldquo;scale up when average CPU exceeds 70%, scale down when it drops. Minimum 2 Pods, maximum 10.\u0026rdquo;\nmetrics-server periodically collects resource usage from each Pod. The HPA Controller compares current average utilization against the target and calculates the needed Pod count. If 3 Pods average 90% CPU with a 70% target, it scales to 90/70 × 3 ≈ 4 Pods.\nHPA requires resource requests on the Deployment. Without requests, there is no denominator for \u0026ldquo;70% of what.\u0026rdquo;\ncontainers: - name: api resources: requests: cpu: 200m memory: 256Mi limits: cpu: 500m memory: 512Mi Requests guarantee minimum resources for a Pod. The Scheduler uses this value to judge Node capacity. Limits cap maximum resources. Exceeding CPU limits causes throttling; exceeding memory limits triggers an OOMKill. Monitoring backend service memory usage and setting appropriate values is important.\nBeyond CPU, HPA supports memory and custom metrics (request count, queue length, etc.). VPA (Vertical Pod Autoscaler) adjusts individual Pod resources instead of Pod count, but using it on the same metrics as HPA simultaneously can cause conflicts.\nOperations kubectl Basics kubectl get pods # List Pods kubectl get pods -o wide # Include Node placement kubectl describe pod \u0026lt;name\u0026gt; # Pod details + events kubectl logs \u0026lt;pod-name\u0026gt; # View logs kubectl logs \u0026lt;pod-name\u0026gt; -f # Follow logs kubectl exec -it \u0026lt;pod-name\u0026gt; -- sh # Shell into container Debugging Flow kubectl get pods shows Pod status. States like CrashLoopBackOff or ImagePullBackOff indicate the cause. kubectl describe pod \u0026lt;name\u0026gt; reveals events — whether the Scheduler failed to find a Node, resources were insufficient, or the image pull failed. kubectl logs shows application logs. If logs are not enough, kubectl exec provides direct access inside the container.\nA Pod failing to start after deployment is the most common k8s issue backend developers encounter. Learning this flow covers most debugging scenarios.\nWrap-up k8s operates on a declarative model: declare the desired state and the system maintains it. Set replicas: 3 on a Deployment and k8s keeps 3 Pods running. Set a target CPU on HPA and k8s adjusts Pod count automatically. For backend developers, the key responsibilities are implementing health check endpoints, configuring resource requests, and knowing kubectl for debugging.\n","permalink":"https://wid-blog.github.io/en/posts/tech/infra/kubernetes-fundamentals/","summary":"Container orchestration basics and what backend developers need to know: core objects, networking, scaling with HPA, and operational essentials.","title":"Kubernetes Fundamentals"},{"content":"Go\u0026rsquo;s concurrency model builds on CSP (Communicating Sequential Processes). The core philosophy is one line:\n\u0026ldquo;Do not communicate by sharing memory; instead, share memory by communicating.\u0026rdquo;\nInstead of locking shared memory, pass data through channels. Goroutines handle execution, Channels handle communication, and the sync/atomic packages provide auxiliary synchronization.\nGoroutine A goroutine is Go\u0026rsquo;s lightweight execution unit. It is not an OS thread. The Go runtime multiplexes many goroutines onto a small number of OS threads.\ngo func() { // This function runs in a new goroutine }() A single go keyword creates one. The initial stack is only a few kilobytes, and the runtime grows and shrinks it automatically as needed. OS threads become impractical at a few thousand; goroutines scale to hundreds of thousands in the same address space.\nGMP Scheduler The Go runtime uses an M:N scheduling model, known as the GMP model.\nflowchart TB subgraph Runtime[\"Go Runtime\"] subgraph P1[\"P (Processor)\"] LRQ1[\"Local Queue: G1, G2, G3\"] end subgraph P2[\"P (Processor)\"] LRQ2[\"Local Queue: G4, G5\"] end GRQ[\"Global Queue: G6, G7...\"] end subgraph OS[\"OS\"] M1[\"M (OS Thread)\"] M2[\"M (OS Thread)\"] M3[\"M (OS Thread)\"] end P1 --\u003e M1 P2 --\u003e M2 GRQ -.-\u003e|\"Stolen when P's local queue is empty\"| P1 G — Goroutine. A lightweight execution unit carrying a function and its stack.\nM — Machine. An OS thread. Executes instructions on the actual CPU.\nP — Processor. A logical processor. Provides the context needed to run goroutines. GOMAXPROCS controls the number of Ps, defaulting to the CPU core count.\nEach P has a local queue. When a goroutine is created, it enters the current P\u0026rsquo;s local queue. An M attaches to a P and executes goroutines from its local queue one by one. If a goroutine blocks on a system call, the runtime moves the other goroutines on that P to a different M to keep them running.\nThe overhead is small enough that a single address space can hold hundreds of thousands of goroutines.\nChannel A Channel is a typed communication mechanism for passing data between goroutines.\nUnbuffered Channel ch := make(chan int) Both sender and receiver must be ready for the transfer to complete. The sender blocks until the receiver takes the value; the receiver blocks until the sender sends. Communication and synchronization happen simultaneously.\nsequenceDiagram participant G1 as Goroutine 1 participant Ch as Channel (unbuffered) participant G2 as Goroutine 2 G1-\u003e\u003eCh: Send (blocks) Note over G1,Ch: Waits until G2 receives G2-\u003e\u003eCh: Receive Ch--\u003e\u003eG1: Send completes Ch--\u003e\u003eG2: Value delivered Buffered Channel ch := make(chan int, 10) // buffer size 10 Sends complete immediately if buffer space is available. The sender blocks only when the buffer is full. Buffered channels can serve as semaphores to limit concurrency.\nDirectionality Specifying channel direction clarifies a function\u0026rsquo;s intent.\nfunc producer(out chan\u0026lt;- int) { // send-only out \u0026lt;- 42 } func consumer(in \u0026lt;-chan int) { // receive-only val := \u0026lt;-in } select The select statement executes whichever channel operation is ready. It handles waiting on multiple channels, timeouts, and non-blocking operations.\nselect { case msg := \u0026lt;-ch1: handle(msg) case ch2 \u0026lt;- response: // send completed case \u0026lt;-quit: return default: // no channel ready } Including default makes the select non-blocking when no channel is ready.\nKey Patterns flowchart LR subgraph FanOut[\"Fan-Out\"] IN1[\"Input\"] --\u003e W1[\"Worker 1\"] IN1 --\u003e W2[\"Worker 2\"] IN1 --\u003e W3[\"Worker 3\"] end subgraph FanIn[\"Fan-In\"] R1[\"Result 1\"] --\u003e OUT1[\"Output\"] R2[\"Result 2\"] --\u003e OUT1 R3[\"Result 3\"] --\u003e OUT1 end Fan-Out. Multiple goroutines read from a single channel to distribute work.\nFan-In. Results from multiple channels merge into one.\nPipeline. Processing stages connected by channels. Each stage reads from an input channel, processes, and sends to an output channel.\nsync Package Channels are not always the best choice. For simple shared state protection, the sync package fits well.\nMutex. Ensures only one goroutine enters a critical section. Controlled with Lock() and Unlock().\nRWMutex. Multiple goroutines read concurrently; writes are exclusive. Effective when reads far outnumber writes.\nWaitGroup. Waits for multiple goroutines to finish. Add() increments the counter, Done() decrements it, Wait() blocks until zero.\nOnce. Runs a function exactly once. Used for initialization.\natomic Package The sync/atomic package provides atomic operations on integers and pointers. It reads and writes single variables safely without locks.\nCompareAndSwap (CAS) is the foundation of lock-free algorithms. If the current value equals the expected value, it swaps in the new value and returns true. Otherwise, it returns false and does nothing.\nvar counter int64 // Safe increment from multiple goroutines atomic.AddInt64(\u0026amp;counter, 1) // CAS: swap only if expected value matches atomic.CompareAndSwapInt64(\u0026amp;counter, oldVal, newVal) These are lower-level tools than the sync package. Suitable for simple counters and flags, but Mutex or Channel is better for complex synchronization.\nSelection Criteria Scenario Tool Passing data between goroutines Channel Work distribution, result collection Channel (fan-out/fan-in) Protecting shared state (read/write) sync.RWMutex Limiting concurrency Buffered Channel Waiting for multiple goroutines sync.WaitGroup Simple counters/flags sync/atomic The Go wiki summarizes it the same way. Channels suit ownership transfer, work distribution, and async result delivery. Mutexes suit caches and shared resource access control. Both are valid tools — choose based on the situation.\n","permalink":"https://wid-blog.github.io/en/posts/tech/language/go-concurrency-model/","summary":"Go\u0026rsquo;s concurrency model builds on CSP, providing Goroutines and Channels as core tools. An overview of how each works and when to choose what.","title":"Go Concurrency Model"},{"content":"I understood Go\u0026rsquo;s concurrency model conceptually. Goroutines are lightweight threads, channels handle communication, the sync package provides synchronization. But I had never compared the patterns side by side with code and benchmarks.\nI decided to implement and benchmark them myself. One project, three approaches: mutex, channel, and lock-free.\nMutex The first implementation was a concurrency-safe map using sync.RWMutex. Writes use Lock(), reads use RLock(), allowing multiple goroutines to access the map concurrently.\nAfter implementing it, I benchmarked against Go\u0026rsquo;s standard sync.Map. I created three scenarios: contended writes on the same key, disjoint writes across different keys per goroutine, and a read-heavy workload at 90% reads.\nOn contended same-key writes, both performed similarly. But on disjoint key writes, sync.Map was 2-3x faster, and on read-heavy workloads, 33% faster. The results matched exactly what the sync.Map documentation states as its optimization conditions. Conversely, on concentrated same-key writes, sync.Map only used more memory with no performance advantage.\nChannel For the channel pattern, I implemented data flow control. FanOut distributes data from one input channel to multiple output channels. It uses a select statement to send to whichever output channel is ready first.\nTurnOut routes from multiple inputs to multiple outputs while handling shutdown signals through a quit channel. Including the quit channel in the select statement lets the loop handle both data processing and graceful shutdown naturally. I also implemented the cleanup process of draining remaining data after closing channels.\nGenerics ([T any]) made the implementations reusable across types.\nLock-free I implemented two lock-free patterns.\nSpinningCAS implements a lock using atomic.CompareAndSwapInt32. When another goroutine holds the lock, instead of entering a wait queue, it spins by repeating the CAS operation. runtime.Gosched() proved critical here. Without yielding the CPU during the spin loop, other goroutines couldn\u0026rsquo;t execute, creating a near-deadlock situation. One line of code changed the entire behavior.\nI benchmarked SpinningCAS against the standard sync.Mutex. On a high-contention scenario incrementing a single shared variable, SpinningCAS was about 7x faster. Mutex carries the overhead of parking and unparking goroutines in a wait queue, while CAS retries immediately. The numbers confirmed that spinning wins on short critical sections.\nTicketStorage addresses cases requiring ordering guarantees. atomic.AddUint64 issues ticket numbers, and each goroutine spins with CAS until its ticket comes up. It guarantees fairness (FIFO) but trades off longer wait times under high contention.\nRetrospective Understanding concurrency patterns conceptually and experiencing them through benchmarks were different things.\nThe biggest lesson was benchmark methodology. I initially wrote benchmarks that spawned a fixed number of goroutines, and results varied between runs. Switching to Go\u0026rsquo;s b.RunParallel, which lets the framework auto-calibrate iteration counts, stabilized results and made pattern differences clear. Benchmark code accuracy determines result quality.\nsync.Map is not \u0026ldquo;always a faster map\u0026rdquo; — its advantage appeared only under the conditions stated in the official documentation. SpinningCAS dominated Mutex on short critical sections, but longer sections or lower contention could reverse the result. Each tool has optimal conditions, and verifying those conditions is what benchmarks are for.\nThe experience of runtime.Gosched() changing behavior with a single line also stayed with me. In concurrent code, a theoretically correct implementation can behave differently in practice.\nOnly after writing the code and facing the numbers did each pattern\u0026rsquo;s trade-offs become tangible. This project confirmed that difference.\nReferences concurrency-go GitHub Repository Go Concurrency Model ","permalink":"https://wid-blog.github.io/en/posts/career/personal/concurrency-go-retrospective/","summary":"A record of implementing and benchmarking three Go concurrency patterns — mutex, channel, and lock-free — to build hands-on understanding.","title":"concurrency-go"},{"content":"MongoDB and Redis are the two tools you run into most often when working with NoSQL. Both sit under the NoSQL umbrella, yet the roles they take in production differ. MongoDB is a document store that holds durable data as the primary store, while Redis is an in-memory store that typically takes a supporting role — cache, sessions, and the like.\nWhen I built chat-services, I first stored messages in Redis, then moved to MongoDB once durability became a requirement. The retrospective only touched on this briefly. This post walks through why two tools sitting under the same NoSQL umbrella end up in different roles, across data model, storage, schema, scaling, and use cases.\nData Model Where the two tools first differ is the unit of storage.\nRedis is a key-value store. Each key maps to a single value, and value types include data structures like String, List, Set, Sorted Set, Hash, and Stream. Single-key reads and writes are the most natural access pattern; complex queries rely on the client side or external indexes.\nMongoDB is a document store. Instead of plain key-value, it stores JSON-like documents (BSON) as units. A single document can hold nested fields, arrays, and varied types, all grouped into collections. A document often corresponds to a single entity, which maps naturally onto domain models.\nStorage and Durability Redis keeps data in memory. Durability is optional, with two mechanisms available: RDB snapshots and AOF (Append Only File) command logs. In a typical configuration with only RDB enabled, changes after the last snapshot can be lost during a failure. AOF narrows the loss window by logging each write, but memory remains the primary storage medium.\nMongoDB stores data on disk. The storage engine is WiredTiger by default, and writes are persisted to disk along with the journal. Durability is built in from the start. Adding a replica set replicates data across nodes, so data survives a single-node failure.\nSchema and Query Redis queries are commands. Commands like GET, SET, HGETALL, ZRANGE are called directly per data structure. Conditional searches are hard to express with the base commands, so modules like RediSearch are added on top for richer queries. There is no schema concept, and how a value is interpreted is up to the application.\nMongoDB provides its own query language, MQL. find, aggregate, and update express conditional searches, aggregation, and partial updates. Collections do not enforce a schema, but JSON Schema validation can constrain field shapes when needed. This fits early stages where domains shift often, and validation can be layered in incrementally once things stabilize.\nScaling and Availability Redis Cluster splits the key space into a fixed number of slots and distributes them across nodes. Each slot is assigned to a node, and one node owns the keys for its slots. Replication uses a master-replica structure. When a failure happens, Sentinel or Cluster promotes a replica to master.\nMongoDB scales in two ways: sharded clusters and replica sets. A sharded cluster splits a collection by shard key across multiple shards. A replica set keeps copies of data on multiple nodes; if the primary fails, one of the secondaries is automatically elected as the new primary. Automatic failover is part of the default behavior.\nUse Cases Here is where their roles diverge.\nRedis often sits in supporting roles: cache, session storage, rate limiting, distributed locks, short-lived queues. It fits cases where fast response and low latency matter, and where data can live in memory and tolerate eviction.\nMongoDB takes the primary store role. Durable domain data lives there, and it tends to be chosen when schemas shift or when document shapes map naturally onto the domain. Users, content, orders, and logs are typical examples.\nBoth tools fall under NoSQL, but in practice the question that splits them is whether the data needs to be durably stored or kept as a supporting cache.\nUsed Together In practice, the two are used together more often than alone.\nThe most common setup puts MongoDB as the primary store and Redis in front as a cache layer. Frequently read data sits in the Redis cache to reduce disk access; on a cache miss, MongoDB is read and the result fills the cache again. Supporting data like sessions, rate limits, and one-time tokens lives only in Redis, leaving MongoDB to focus on domain data.\nIn chat-services, messages started in Redis and moved to MongoDB once durability was needed. Adding a Redis cache layer in front of MongoDB later would put the same two tools in the roles that suit them best.\nChoosing Between Them The comparison, in table form:\nAspect Redis MongoDB Data model Key-value Document Primary medium Memory Disk Default durability Optional (RDB/AOF) On by default (journal + replica set) Query Commands MQL (find/aggregate) Schema None Flexible + optional validation Scaling Cluster (slot distribution) Sharded cluster (shard key) Auto failover Sentinel/Cluster Replica set by default Primary role Cache, sessions, rate limit Primary store Three things decide it:\nIf durable storage is required, MongoDB is the primary and Redis sits as a supporting cache. If latency matters most on a hot path, Redis. If schemas shift often, MongoDB\u0026rsquo;s document model fits the domain well. Even under the same NoSQL umbrella, the two tools take on different roles. That\u0026rsquo;s why the combination shows up so often in real systems.\nReferences chat-services — Redis → MongoDB transition context What RDB Transaction ACID Actually Guarantees — RDB transaction guarantees MongoDB docs — Storage Engines, Replica Set, Sharded Cluster Redis docs — Persistence, Cluster, Replication ","permalink":"https://wid-blog.github.io/en/posts/tech/database/mongodb-vs-redis/","summary":"Why MongoDB and Redis end up in different roles even under the same NoSQL umbrella. A comparison across data model, storage, schema, scaling, and use cases.","title":"MongoDB vs Redis — Same NoSQL, Different Roles"},{"content":"I used Kafka at work, but I had never configured a cluster from scratch or made decisions from topic design to consumer group strategy. Hexagonal Architecture was similar — I had followed port/adapter patterns in existing code without ever structuring layers from an empty project. I wanted hands-on experience with both, so I decided to build a chat system.\nWhy Chat Chat aligns naturally with Kafka\u0026rsquo;s pub/sub model. Publishing messages and delivering them to subscribers mirrors the core behavior of a chat system.\nReal-time communication over WebSocket, event-driven architecture, message synchronization across multiple instances — I decided a single project could cover all three.\nTechnology Choices Go + Java I built the chat service in Go. Lightweight goroutine-based concurrency suited a WebSocket server well. The user authentication service used Java with Spring WebFlux. The Spring Security ecosystem provided solid OAuth2 + JWT support, and I was already familiar with the framework.\nThe API Gateway used Kotlin with Spring Cloud Gateway. It ran on the same reactive stack as the user-service, maintaining consistency within the Java ecosystem.\nMongoDB Chat messages fit naturally into a document structure. Rooms and messages resembled unstructured data, and I expected frequent schema changes.\nI started with Redis. It worked well for quick prototyping, but I switched to MongoDB when message persistence became necessary.\nKafka KRaft I configured Kafka in KRaft mode — Kafka managing its own metadata without depending on ZooKeeper. No need to operate a separate ZooKeeper cluster, which simplified the infrastructure.\nI set up a 3-node cluster using Docker Compose, with each node serving as both controller and broker.\nArchitecture Evolution The project was not designed all at once. It evolved incrementally through pull requests.\nStarting Point I started with two services: user-service (Java) and chat-service (Go). The chat-service handled WebSocket connections, room management, message storage, and broadcasting. Redis served as the data store.\nRedis → MongoDB Messages needed persistent storage. Redis was not suitable due to its in-memory nature, so I replaced it with MongoDB. During this process, I experienced the benefit of only needing to swap the repository layer — a direct advantage of Hexagonal Architecture.\nHexagonal Architecture Cleanup I restructured the user-service first. Packages that had been loosely organized were rearranged into domain/entity, port/driving, port/driven, adapter/driving, and adapter/driven. I then applied the same structure to the chat-service.\nKafka Integration I implemented the Kafka producer first, then added the consumer. This is when I encountered concurrency issues.\nRace conditions occurred when users joined or left chat rooms while messages were being broadcast simultaneously. I introduced a two-level lock strategy in the RoomManager: an RWMutex at the RoomManager level for room list access, and a separate RWMutex per LiveRoom for participant access. This reduced contention.\nService Separation As the chat-service grew, I split it into messenger-service and message-service. The messenger-service handles Kafka producer/consumer and WebSocket connections. The message-service handles message storage and retrieval.\nFat Domain Initially, domain entities only held data. I moved domain logic into entities and introduced the use case pattern in the application layer. Each use case has a single Handle method, responsible for one business operation.\nKafka as Chat Message Broker The message flow:\nsequenceDiagram participant C as WebSocket Client participant S as SendUseCase participant DB as MongoDB participant K as Kafka participant B as MessageBroker participant R as RoomManager C-\u003e\u003eS: Send message S-\u003e\u003eDB: Store message S-\u003e\u003eK: Kafka publish K-\u003e\u003eB: Consumer receives B-\u003e\u003eS: OnReceive callback S-\u003e\u003eR: Broadcast R-\u003e\u003eC: WebSocket delivery SendUseCase directly implements the MessageSubscriber interface and registers itself with the MessageBroker — the Observer pattern. When the consumer receives a message, it calls OnReceive on all registered subscribers. Each subscriber uses the RoomManager to deliver the message to every WebSocket client in the corresponding room.\nThe advantage is horizontal scaling. When multiple chat service instances run, a message from one instance reaches other instances through Kafka. Users connected to different instances can still exchange messages within the same room.\nRetrospective I started this project because I wanted hands-on experience with Kafka.\nI confirmed that Hexagonal Architecture works naturally in Go. Go\u0026rsquo;s implicit interfaces made defining ports and implementing adapters straightforward. Assembling dependencies directly in the main function without a DI framework turned out to be explicit and easy to trace.\nConcurrency control taught me the most. I initially protected the entire room list with a single RWMutex, which created a bottleneck. Switching to a two-level strategy — separate locks for room list access and per-room participant access — showed a clear difference in benchmarks. Understanding concurrency in theory and experiencing it through benchmarks were different things entirely.\nThere are regrets. Test coverage was insufficient. One key benefit of Hexagonal Architecture is easy testing by swapping ports with mocks, but I did not write enough tests to take full advantage of this.\nI also configured gRPC but never applied it to inter-service communication. All services currently communicate over REST. gRPC integration remains for the next iteration.\nI started because I wanted to work with Kafka directly, and I gained more than that. Architecture design, concurrency control, service decomposition — encountering them together within a single system was a different experience from studying each one separately.\nReferences chat-services GitHub Repository Implementing Hexagonal Architecture in Go Kafka Fundamentals and KRaft Mode Spring WebFlux Fundamentals — Non-blocking I/O and the Reactive Stack HTTP/1.1 and HTTP/2 MongoDB vs Redis — Same NoSQL, Different Roles — Context behind the Redis → MongoDB transition ","permalink":"https://wid-blog.github.io/en/posts/career/personal/chat-services-retrospective/","summary":"A record of designing and building a chat system as a personal project to gain hands-on experience with Kafka and Hexagonal Architecture.","title":"chat-services"},{"content":"Spring MVC assigns a thread to each incoming request. That thread waits for DB query results, waits for external API responses. In I/O-bound workloads, most threads end up waiting. WebFlux changes this structure with an event loop-based non-blocking model.\nSpring MVC\u0026rsquo;s Thread Model Spring MVC uses a thread-per-request model. When Tomcat receives a request, it pulls a thread from the pool and assigns it. That thread handles the entire lifecycle: controller → service → database access → response.\nThe problem is I/O wait time. Executing a JDBC query blocks the thread until results arrive. Calling an external API with RestTemplate does the same. The thread consumes resources while doing nothing.\nConcurrent request capacity is bound by thread pool size. Tomcat defaults to 200 threads. The 201st request queues until an earlier one completes. Throughput drops sharply when I/O waits grow long.\nWebFlux\u0026rsquo;s Event Loop Model WebFlux runs on a Netty-based event loop. When a thread receives a request, it delegates I/O operations to the OS and immediately moves to the next request. When I/O completes, a callback delivers the result and the remaining logic executes.\nThreads never wait, so a small number of threads handle many concurrent requests. Event loop threads equal to the CPU core count can manage thousands of simultaneous connections.\nThroughput increases for I/O-bound workloads. API gateways, OAuth2 authentication servers, and inter-service communication in microservice architectures are typical examples.\nCPU-bound tasks see no benefit. Running heavy computation on an event loop thread blocks other request processing. Image processing, encryption, and similar work require offloading to a separate thread pool.\nReactor The event loop model relies on callbacks. Nested callbacks increase code complexity. Project Reactor solves this with declarative pipelines. It provides two core types.\nMono: Returns 0 or 1 result asynchronously. Used for looking up a single user from the database or receiving a token from an external API.\nFlux: Returns 0 to N results as an asynchronous stream. Used for querying lists from the database or subscribing to real-time events.\n@GetMapping(\u0026#34;/{id}\u0026#34;) public Mono\u0026lt;User\u0026gt; getUser(@PathVariable String id) { return userRepository.findById(id); } @GetMapping public Flux\u0026lt;User\u0026gt; getAllUsers() { return userRepository.findAll(); } Reactor composes asynchronous pipelines through operator chaining.\npublic Mono\u0026lt;AuthToken\u0026gt; login(String code) { return oauth2Client.getToken(code) // Mono\u0026lt;TokenDto\u0026gt; .map(tokenMapper::toDomain) // Mono\u0026lt;Token\u0026gt; .flatMap(userRepository::upsertUser) // Mono\u0026lt;User\u0026gt; .map(authService::generateToken); // Mono\u0026lt;AuthToken\u0026gt; } map performs synchronous transformation; flatMap performs asynchronous transformation. Use flatMap for operations that return another Mono or Flux, such as I/O calls.\nOne important characteristic: Reactor executes at subscription time. Creating a Mono or Flux does nothing on its own. The pipeline runs only when .subscribe() is called or when WebFlux returns it as a response.\nWebClient RestTemplate is blocking. Running a blocking call on an event loop thread occupies that thread and degrades overall throughput. In WebFlux environments, use WebClient — the non-blocking HTTP client.\npublic Mono\u0026lt;UserDto\u0026gt; getResource(String accessToken) { return webClient .get() .uri(uri -\u0026gt; uri.queryParam(\u0026#34;access_token\u0026#34;, accessToken).build()) .retrieve() .onStatus(HttpStatusCode::is4xxClientError, this::handleError) .bodyToMono(UserDto.class); } retrieve() receives the response; bodyToMono() deserializes it. The entire process is non-blocking. onStatus() defines error handling declaratively by HTTP status.\nWebClient works in Spring MVC too. RestTemplate entered maintenance mode in Spring 5, and Spring recommends WebClient as its replacement.\nReactive Data Access The reactive stack delivers its full benefit when the entire pipeline is non-blocking. If the controller and service are non-blocking but database access blocks, the event loop thread stalls.\nMongoDB: Spring Data provides ReactiveMongoRepository. All CRUD methods return Mono or Flux.\ninterface UserDao extends ReactiveMongoRepository\u0026lt;UserEntity, String\u0026gt; {} Relational databases: Use R2DBC (Reactive Relational Database Connectivity), the reactive alternative to JDBC. Drivers for MySQL, PostgreSQL, and others are available.\nJPA is blocking. Hibernate and JDBC block threads, so using JPA with WebFlux eliminates non-blocking benefits. Choose R2DBC for relational databases or Reactive MongoDB for document stores.\nWhen to Choose Which WebFlux and Spring MVC are not mutually exclusive. Spring designed them to coexist in the same project. In practice, unifying the entire stack under one model works better.\nWebFlux fits when:\nI/O-bound workloads dominate High concurrency is required (API gateways, authentication servers) Inter-service communication is frequent The data store has reactive drivers (MongoDB, Redis) Spring MVC fits when:\nJDBC/JPA-based relational databases are central CPU-bound tasks dominate The team is unfamiliar with reactive programming Spring MVC assigns one thread per request. When I/O grows, threads sit idle. WebFlux removes that wait time with an event loop. Which model fits comes down to whether the workload is I/O-bound or CPU-bound, and how much concurrency it needs.\n","permalink":"https://wid-blog.github.io/en/posts/tech/language/spring-webflux-reactive-stack/","summary":"Spring MVC assigns one thread per request. When I/O waits pile up, threads sit idle. WebFlux replaces this with an event loop-based non-blocking model. A summary of the structural differences from MVC, the Reactor pattern, and when to choose which.","title":"Spring WebFlux Fundamentals — Non-blocking I/O and the Reactive Stack"},{"content":"HTTP/1.1 processes requests and responses sequentially. One connection handles one request at a time. When multiple resources are needed simultaneously, multiple connections must be opened. HTTP/2 solved this limitation by processing multiple requests in parallel over a single connection.\nHTTP/1.1 Limitations HOL Blocking HTTP/1.1 processes request-response pairs sequentially on a single TCP connection. The second request waits until the first response arrives. This is HOL blocking — Head-of-Line Blocking.\nsequenceDiagram participant C as Client participant S as Server Note over C,S: HTTP/1.1 — Sequential C-\u003e\u003eS: GET /style.css S--\u003e\u003eC: style.css response C-\u003e\u003eS: GET /script.js S--\u003e\u003eC: script.js response C-\u003e\u003eS: GET /image.png S--\u003e\u003eC: image.png response Rendering a single web page requires dozens of resources. Sequential processing is slow.\nHTTP/1.1 introduced pipelining as a workaround — sending multiple requests without waiting for responses. But responses must still arrive in request order. A slow first response delays everything behind it. Implementation issues led most browsers to disable pipelining.\nIn practice, concurrent TCP connections are the real workaround. Browsers open up to 6 simultaneous TCP connections per domain for parallel requests. But each connection incurs TCP handshake and TLS handshake costs.\nHeader Redundancy HTTP/1.1 transmits headers as text. Headers like User-Agent, Cookie, and Accept repeat with every request. With cookies included, request headers alone can reach several KB. No compression mechanism exists, so identical information transmits every time.\nHTTP/2 HTTP/2 was standardized in 2015, based on Google\u0026rsquo;s SPDY protocol. It maintains the same HTTP semantics while fundamentally changing the transport mechanism.\nBinary Framing HTTP/1.1 is text-based — requests and responses are human-readable strings. HTTP/2 encodes all messages as binary frames.\nblock-beta columns 3 block:http1[\"HTTP/1.1\"]:3 h1[\"GET /index.html HTTP/1.1\\nHost: example.com\\nAccept: text/html\"] end space:3 block:http2[\"HTTP/2\"]:3 h2a[\"HEADERS frame\\n(stream 1)\"] h2b[\"DATA frame\\n(stream 1)\"] h2c[\"HEADERS frame\\n(stream 3)\"] end style http1 fill:#FFCDD2 style http2 fill:#C8E6C9 A single HTTP message splits into HEADERS frames and DATA frames. Each frame carries a stream ID tag, allowing the receiver to reassemble frames into the correct messages.\nMultiplexing The core improvement in HTTP/2. Multiple streams operate simultaneously over a single TCP connection. Each stream is an independent request-response pair. Frames interleave at the frame level, so a slow stream does not block others.\nsequenceDiagram participant C as Client participant S as Server Note over C,S: HTTP/2 — Multiplexing C-\u003e\u003eS: Stream 1: GET /style.css C-\u003e\u003eS: Stream 3: GET /script.js C-\u003e\u003eS: Stream 5: GET /image.png S--\u003e\u003eC: Stream 3: script.js (partial) S--\u003e\u003eC: Stream 1: style.css (complete) S--\u003e\u003eC: Stream 5: image.png (partial) S--\u003e\u003eC: Stream 3: script.js (complete) S--\u003e\u003eC: Stream 5: image.png (complete) No need to open multiple TCP connections as in HTTP/1.1. A single connection handles all requests. TCP handshake and TLS handshake costs occur just once.\nHPACK Header Compression HTTP/2 compresses headers with HPACK, a dedicated compression algorithm combining two techniques.\nStatic/dynamic tables: Frequently used header fields are replaced with index numbers. If :method: GET maps to index 2 in the static table, two bytes suffice. Headers appearing during the connection are added to the dynamic table and referenced by index in subsequent requests.\nHuffman encoding: Values not in the tables are compressed using Huffman coding. High-frequency characters receive shorter bit representations, reducing overall size.\nHeaders that repeated at several KB per request in HTTP/1.1 shrink to tens of bytes.\nServer Push When a client requests HTML, the server can proactively send CSS and JavaScript that the HTML needs — without waiting for additional client requests.\nsequenceDiagram participant C as Client participant S as Server C-\u003e\u003eS: GET /index.html S--\u003e\u003eC: PUSH_PROMISE: /style.css S--\u003e\u003eC: PUSH_PROMISE: /script.js S--\u003e\u003eC: /index.html response S--\u003e\u003eC: /style.css response (pushed) S--\u003e\u003eC: /script.js response (pushed) This eliminates the round-trip time for the client to parse HTML and request additional resources. In practice, pushing cached resources unnecessarily limits its usefulness.\nStream Prioritization Clients can assign weights and dependencies to each stream. Setting a CSS file\u0026rsquo;s priority higher than images causes the server to transmit CSS first. This improves initial load time by prioritizing render-critical resources.\nHTTP/2 Limitations HTTP/2 solved HTTP-level HOL blocking but TCP-level HOL blocking remains. All streams operate over a single TCP connection, so when a TCP packet is lost, all streams wait until that packet is retransmitted.\nHTTP/3 addresses this by using the UDP-based QUIC protocol instead of TCP. Each stream becomes an independent transport unit, so packet loss in one stream does not affect others.\ngRPC gRPC is an RPC framework developed by Google. It uses HTTP/2 as the transport protocol and Protocol Buffers as the serialization format.\nHTTP/2 Utilization gRPC leverages HTTP/2 multiplexing to process multiple RPC calls in parallel over a single connection. HPACK header compression applies as well. Extending HTTP/2\u0026rsquo;s streaming capability, gRPC supports four communication patterns.\nblock-beta columns 2 block:unary[\"Unary\"]:1 columns 3 u1[\"Client\"] space u2[\"Server\"] u1 -- \"1 req / 1 res\" --\u003e u2 end block:server[\"Server Streaming\"]:1 columns 3 s1[\"Client\"] space s2[\"Server\"] s1 -- \"1 req / N res\" --\u003e s2 end block:client[\"Client Streaming\"]:1 columns 3 c1[\"Client\"] space c2[\"Server\"] c1 -- \"N req / 1 res\" --\u003e c2 end block:bidi[\"Bidirectional Streaming\"]:1 columns 3 b1[\"Client\"] space b2[\"Server\"] b1 -- \"N req / N res\" --\u003e b2 end style unary fill:#E3F2FD style server fill:#E8F5E9 style client fill:#FFF3E0 style bidi fill:#F3E5F5 Protocol Buffers Unlike REST\u0026rsquo;s JSON, gRPC uses Protocol Buffers, Protobuf. Define message structures and service interfaces in a .proto file, and code for each language is auto-generated.\nservice ChatService { rpc SendMessage (MessageRequest) returns (MessageResponse); rpc StreamMessages (RoomRequest) returns (stream Message); } message MessageRequest { string room_id = 1; string user_id = 2; string content = 3; } Protobuf uses binary serialization. Messages are 3–5x smaller than JSON. Serialization and deserialization run 5–10x faster. The schema in the .proto file enforces the interface contract between client and server at the code level.\nREST vs gRPC gRPC fits when:\nInternal microservice communication: low latency, high throughput needed Bidirectional streaming: real-time chat, event subscriptions Polyglot environments: one .proto file generates client/server code for multiple languages REST fits when:\nExternal APIs: browsers do not support gRPC directly — gRPC-Web or a proxy is required Debugging convenience: JSON is human-readable, inspectable via curl or browser dev tools Simple CRUD: resource-based APIs that do not need RPC-style interfaces In practice, combining gRPC for internal service communication and REST for external APIs is a common pattern.\nHTTP/1.1 processes requests and responses sequentially. HTTP/2 removed this constraint with multiplexing. gRPC adds binary serialization and streaming on top of HTTP/2\u0026rsquo;s advantages. Each protocol addresses different problems, and the workload determines the choice.\n","permalink":"https://wid-blog.github.io/en/posts/tech/network/http1-vs-http2/","summary":"HTTP/1.1 processes requests and responses sequentially. HTTP/2 changed this with multiplexing, binary framing, and header compression. A summary of the differences between the two protocols and gRPC, which runs on top of HTTP/2.","title":"HTTP/1.1 and HTTP/2"},{"content":"A container packages an application with its execution environment to run identically anywhere.\nContainer Process Isolation A container is an isolated process running on the host OS. There is no guest OS. It shares the host kernel while isolating the process tree, network, and filesystem.\nNamespaces partition process lists, network interfaces, and filesystems into independent spaces. A process inside a container cannot see other containers.\ncgroups (Control Groups) cap CPU, memory, and disk I/O usage. Without them, a single container could consume all host resources.\nComparison with VMs Process isolation through namespaces and cgroups differs fundamentally from VMs.\nA VM installs a full guest OS on top of a hypervisor. This provides hardware-level isolation, but including an entire OS makes images large and startup slow.\nA container shares the host kernel and isolates at the process level only. No guest OS means lighter images and faster startup. The isolation level is lower than a VM, but sufficient for most deployment scenarios.\nflowchart TB subgraph vm[\"VM\"] direction TB HW1[\"Hardware\"] --\u003e HV[\"Hypervisor\"] HV --\u003e G1[\"Guest OS + App\"] HV --\u003e G2[\"Guest OS + App\"] end subgraph ct[\"Container\"] direction TB HW2[\"Hardware\"] --\u003e OS[\"Host OS\\nShared Kernel\"] OS --\u003e CR[\"Container Runtime\"] CR --\u003e C1[\"App\"] CR --\u003e C2[\"App\"] end Docker Architecture Docker operates as a client-server architecture.\nflowchart LR CLI[\"Docker CLI\"] --\u003e|REST API| D[\"Docker Daemon\\ndockerd\"] D --\u003e CTD[\"containerd\"] CTD --\u003e RUNC[\"runc\"] RUNC --\u003e C1[\"Container\"] RUNC --\u003e C2[\"Container\"] Running docker run or docker build sends the command from the Docker CLI to the Docker Daemon (dockerd) via REST API. The Daemon is the core process managing container lifecycles.\nTwo more layers exist below that. containerd handles container execution, image management, and storage. runc configures namespaces and cgroups to start the container process. runc implements the OCI (Open Container Initiative) standard.\nBecause the layers are separate, containers can run with containerd alone, without the Docker Daemon. Kubernetes dropped its Docker dependency for this reason.\nCore Concepts Image Running a container requires an image. An image is a read-only template containing application code, runtime, libraries, and configuration files.\nImages use a layered structure. Each layer adds only the changes on top of the previous one. Multiple images sharing common layers reduces disk usage and build time.\nContainer A container is a running instance created from an image. It adds a writable layer on top of the image\u0026rsquo;s read-only layers. Multiple containers run independently from the same image.\nDeleting a container removes its writable layer. Persistent data requires volumes to store separately on the host.\nRegistry A registry stores and distributes images. Docker Hub is the most widely used public registry, and organizations often run private ones.\nStoring an image to a registry is a push; retrieving one is a pull.\nflowchart LR DF[\"Dockerfile\"] --\u003e|docker build| IMG[\"Image\"] IMG --\u003e|docker run| CT[\"Container\"] IMG --\u003e|docker push| REG[\"Registry\"] REG --\u003e|docker pull| IMG2[\"Image\"] Dockerfile Building an image requires a Dockerfile. It describes base image selection, application copying, dependency installation, and run commands in order.\nKey Instructions FROM node:20-alpine WORKDIR /app COPY package*.json ./ RUN npm ci --production COPY . . EXPOSE 3000 CMD [\u0026#34;node\u0026#34;, \u0026#34;server.js\u0026#34;] FROM specifies the base image. It is the starting point of every Dockerfile. WORKDIR sets the working directory, and COPY brings files from the host into the image.\nRUN executes commands during the build. It handles dependency installation and compilation. Each RUN instruction creates a new layer.\nEXPOSE documents the port the container listens on. Actual port binding happens with docker run -p. CMD sets the default command to run when the container starts.\nCMD vs ENTRYPOINT They look similar but serve different purposes.\nCMD is the default startup command. Passing arguments to docker run replaces CMD entirely. ENTRYPOINT fixes the command that always executes. Arguments from docker run append after ENTRYPOINT.\n# CMD — can be replaced by docker run arguments CMD [\u0026#34;node\u0026#34;, \u0026#34;server.js\u0026#34;] # ENTRYPOINT — always runs node; accepts additional arguments ENTRYPOINT [\u0026#34;node\u0026#34;] CMD [\u0026#34;server.js\u0026#34;] Using both together, ENTRYPOINT defines the executable and CMD provides default arguments. Running docker run \u0026lt;image\u0026gt; worker.js replaces only the CMD portion. A common pattern for distributing CLI tools as containers.\nMulti-stage Build Multiple FROM instructions in a single Dockerfile separate build stages. Build tools and source stay in the build stage; the final image contains only the artifacts needed for execution.\n# Build stage FROM golang:1.22 AS builder WORKDIR /app COPY . . RUN go build -o server . # Run stage FROM alpine:3.19 COPY --from=builder /app/server /server CMD [\u0026#34;/server\u0026#34;] The compiler and source are excluded from the final image, reducing its size. The effect is significant for compiled languages like Go.\nDocker Compose Real-world services run multiple containers together: a web server, a database, a cache. Managing each with individual docker run commands gets cumbersome.\nDocker Compose defines and manages multi-container applications in a single file.\nBasic Structure services: api: build: . ports: - \u0026#34;8080:3000\u0026#34; depends_on: - db - redis environment: DATABASE_URL: postgres://user:pass@db:5432/mydb db: image: postgres:16 volumes: - db-data:/var/lib/postgresql/data redis: image: redis:7-alpine volumes: db-data: services defines each container. build specifies a Dockerfile path; image uses an existing one.\nvolumes persists data outside the container. Data survives container deletion.\ndepends_on sets the startup order between services. It does not wait for a service to become \u0026ldquo;ready,\u0026rdquo; so the application needs its own retry logic.\nCommon Commands docker compose up starts all services. Add -d to run in the background. docker compose down removes containers and networks but preserves volumes. docker compose logs -f follows logs in real time.\nWrap-up A container is a process isolation technology. Docker manages it through images, containers, and registries. Building with Dockerfile and composing with Docker Compose maintains identical environments from development to deployment.\n","permalink":"https://wid-blog.github.io/en/posts/tech/infra/docker-container-fundamentals/","summary":"Covers container concepts, the differences from VMs, Docker\u0026rsquo;s architecture, and the basics of Dockerfile and Docker Compose.","title":"Docker Container Fundamentals"},{"content":"How should a codebase be organized? Splitting by technical responsibility (Controller, Service, Repository) is horizontal slicing. Splitting by feature or domain (User, Order, Payment) is vertical slicing. The organizing principle determines the scope of changes, inter-team dependencies, and deployment boundaries.\nHorizontal Slicing Horizontal slicing separates code by technical responsibility into layers. Layered Architecture is the canonical example.\nsrc/ controllers/ UserController OrderController PaymentController services/ UserService OrderService PaymentService repositories/ UserRepository OrderRepository PaymentRepository Code with the same technical role lives in the same directory. Controllers with controllers, services with services.\nStrengths Technical concerns are clearly separated. HTTP handling lives only in the Controller layer. Data access lives only in the Repository layer. Swapping a layer is straightforward — replacing a REST API with gRPC only touches the Controller layer.\nThe barrier to entry is low. Most frameworks (Spring, Express, Django) default to this structure. New team members grasp the layout quickly.\nLimitations Changing \u0026ldquo;order cancellation\u0026rdquo; touches OrderController, OrderService, OrderRepository, and possibly PaymentService. The change spans multiple layers.\nAs features grow, each layer accumulates dozens of files. services/ ends up with UserService, OrderService, PaymentService, NotificationService, InventoryService, ShippingService side by side. Same technical layer, entirely different domain contexts.\nVertical Slicing Vertical slicing separates code by feature or domain. Each slice contains all the layers it needs.\nsrc/ user/ UserController UserService UserRepository order/ OrderController OrderService OrderRepository payment/ PaymentController PaymentService PaymentRepository Everything related to \u0026ldquo;orders\u0026rdquo; lives in order/. Modifying order features does not require touching other directories.\nVertical Slice Architecture General vertical slicing divides code by domain module (User, Order, Payment). Jimmy Bogard\u0026rsquo;s Vertical Slice Architecture goes further, splitting by individual use case rather than module.\nsrc/ features/ CreateOrder/ CreateOrderHandler CreateOrderRequest CreateOrderValidator CancelOrder/ CancelOrderHandler CancelOrderRequest GetOrderDetail/ GetOrderDetailHandler GetOrderDetailQuery Each use case is an independent slice. CreateOrder and CancelOrder belong to the same \u0026ldquo;order\u0026rdquo; domain but exist as separate slices. Within a single slice, the entire flow from request handling to data access is self-contained.\nStrengths Feature independence is high. Modifying one slice does not affect others. Code review scope narrows, and merge conflicts decrease. This structure suits multiple teams working independently in the same codebase.\nChange cohesion is high. A single feature change completes within a single directory. Code review scope narrows, and changes do not affect other domains.\nLimitations Shared code management is tricky. Cross-cutting concerns like authentication, logging, and transaction handling are needed identically across multiple slices. Copying them into each slice creates duplication. Extracting shared modules introduces inter-slice dependencies, weakening independence. Judging the line between \u0026ldquo;acceptable duplication\u0026rdquo; and \u0026ldquo;excessive duplication\u0026rdquo; is necessary.\nConsistency maintenance has a cost. As slices change independently, coding styles, error handling, and logging patterns may diverge across slices. Team conventions and code reviews keep them aligned.\nSelection Criteria This is not about choosing one over the other. The appropriate direction depends on the situation.\nEarly-stage projects with a small domain suit horizontal slicing. Few features mean few files per layer, and the framework\u0026rsquo;s default structure can be used as-is. The cost of setting up the architecture is low.\nAs features multiply and teams grow, vertical slicing\u0026rsquo;s advantages emerge. Independent work per feature becomes possible, and the scope of each change shrinks. When transitioning to microservices, vertically sliced code maps naturally to service boundaries.\nHybrid approaches are common. The top level is split vertically by domain, and within each domain, code is organized horizontally by technical layer.\nsrc/ order/ controller/ service/ repository/ payment/ controller/ service/ repository/ Vertical slicing defines domain boundaries. Horizontal slicing organizes code within each domain. A structure frequently encountered in practice.\nNeither approach is universally better — the appropriate direction depends on project scale and team structure. Horizontal slicing works well when starting small. As features multiply and teams grow, defining domain boundaries through vertical slicing is a natural progression.\n","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/horizontal-vertical-slicing/","summary":"The difference between splitting code by technical layers (horizontal) and by features or domains (vertical). Trade-offs and selection criteria for each approach.","title":"Horizontal vs Vertical Slicing"},{"content":"The legacy ad server had to go.\nIt served ads when the primary ad system went down. It also handled dashboard previews, external platform integrations, and test requests. The codebase was old — maintenance costs kept climbing.\nRemoving the legacy meant building something to take over its role. I designed a dedicated fallback server for failure recovery.\nRequirements The conditions were clear.\nThe fallback server receives no traffic when the primary system runs normally. It activates only when the primary system determines it cannot serve ads. Reducing server costs during idle time was one of the goals.\nWhen a request arrives, the server must handle candidate retrieval, filtering, and response generation in a single pass. Multiple filters each reference independent data sources.\nComplex business logic concentrated in a single API endpoint.\nTechnology Stack I chose Nest.js. The team\u0026rsquo;s operational environment centered on Node.js/TypeScript, and Nest.js\u0026rsquo;s DI container and Module system seemed well-suited for organizing business logic at this scale.\nFor the HTTP server, I picked Fastify over Express. Throughput matters for an ad server, and Fastify delivers better performance with the same API surface.\nTypeORM handled the ORM layer. Ad data and recommendation data lived in separate databases, requiring multi-datasource connections — something TypeORM supports natively.\nThe design philosophy and core concepts of Nest.js are covered in Nest.js Fundamentals — DI and Module System.\nArchitecture Decision Nest.js defaults to vertical module slicing. Each domain gets its own module containing a Controller, Service, and Repository. A user module, a product module, an order module.\nThis approach did not fit the project.\nDomain boundaries were unclear. The server had ad retrieval, filtering, and user parsing, but these were not independent domains. Every request followed the same flow, and every feature contributed sequentially to producing a single response.\nI chose a horizontal layered architecture instead. Layers divided by technical responsibility.\nPresentation: HTTP request/response handling. Controllers, Interceptors, and Filters belong here. Application: Business logic. Orchestrates multiple services and transforms data. Interfaces for infrastructure dependencies are defined in this layer. Domain: Business entities. Pure models with no technical dependencies. Infra: External system integration. Database access, Redis cache, and external API call implementations belong here. To maintain dependency direction, I applied DIP. The Application layer defines interfaces; the Infra layer provides implementations. Nest.js\u0026rsquo;s DI container manages these connections through Symbol-based tokens.\nResults and Lessons Validation came when replacing an external component. Swapping an internal implementation for an external library required changes only in the Infra layer. The Application layer\u0026rsquo;s interfaces remained untouched. Layer separation produced real flexibility.\nA colleague suggested adopting vertical module slicing to follow Nest.js conventions. A reasonable suggestion. But this project was not a multi-domain service — it was a single API executing a complex pipeline. Horizontal layering fit that characteristic better.\nService class bloat was a real issue. The filtering service accumulated many filter combinations, making the code lengthy. I addressed this by exposing a Facade to reduce the external interface and splitting internal services into smaller units.\nThe legacy server had to go. So I built a new one. For a single API with many filters in one pipeline — the framework\u0026rsquo;s default structure was not the answer.\nReferences Nest.js Fundamentals — DI and Module System Layered Architecture and Dependency Inversion ","permalink":"https://wid-blog.github.io/en/posts/career/dable/ad-fallback-server-retrospective/","summary":"Designing a Nest.js-based fallback server while removing a legacy ad server. Why a horizontal layered architecture fit better than Nest.js\u0026rsquo;s default vertical module slicing for a single API with complex business logic.","title":"Ad Fallback Server Design Retrospective"},{"content":"As Express projects grow, dependency management falls on the developer. Creating service objects, passing them where needed, tracking their lifecycle — all manual work. Nest.js solves this at the framework level. It provides a DI container and Module system that give applications structure.\nIoC and DI Nest.js builds on an IoC Container and Dependency Injection. Registering a class as a Provider via the @Injectable() decorator is enough — declaring that type in another class\u0026rsquo;s constructor causes the instance to be injected automatically.\n@Injectable() class OrderService { constructor( private readonly productService: ProductService, private readonly paymentService: PaymentService, ) {} } Developers do not instantiate with new. The IoC Container finds the type among registered Providers and hands it to the constructor. The hierarchy of DIP, IoC, and DI itself is covered in the dependency-injection post.\nModule System A Nest.js application consists of Modules. The @Module decorator accepts four metadata properties.\nimports: Other modules this module depends on providers: Services, repositories, and other injectables this module supplies controllers: Controllers handling HTTP requests exports: Providers exposed to other modules @Module({ imports: [TypeOrmModule.forFeature([ProductEntity])], providers: [OrderService, ProductService], controllers: [OrderController], exports: [OrderService], }) class OrderModule {} Module boundaries enforce encapsulation. Without these boundaries, the entire project becomes one tangled dependency graph. Providers not listed in exports remain inaccessible from outside. If OrderModule does not export ProductService, other modules cannot use it directly. Only OrderService is exposed.\nGlobal Module Some Providers need to be available everywhere: loggers, configuration stores, error trackers. The @Global decorator makes a module\u0026rsquo;s exports accessible application-wide after a single registration.\n@Global() @Module({ providers: [Logger, ConfigStore], exports: [Logger, ConfigStore], }) class SharedModule {} Dynamic Module Some modules change behavior based on configuration. TypeORM\u0026rsquo;s database connection is a common example.\nTypeOrmModule.forRoot({ type: \u0026#34;mysql\u0026#34;, host: \u0026#34;localhost\u0026#34;, }); forRoot registers global configuration in the root module. forFeature registers specific entities in consuming modules. The same module class gets reused with different configurations.\nProvider A Provider is anything injectable in Nest.js. Services, repositories, factories all qualify. The @Injectable() decorator registers a class with the IoC container. Multiple registration methods enable flexible implementation swapping.\nRegistration Methods Four ways to register a Provider in a Module.\nuseClass: Registers a class as the Provider.\n{ provide: OrderRepository, useClass: OrderRepositoryImpl } useFactory: Creates a Provider through a factory function. Can inject other Providers for dynamic instantiation.\n{ provide: CACHE_CLIENT, useFactory: (config: ConfigStore) =\u0026gt; CacheClient.from(config), inject: [ConfigStore], } useValue: Registers an already-created value as a Provider.\nuseExisting: Creates an alias for an existing Provider.\nCustom Provider and Symbol Tokens TypeScript interfaces do not exist at runtime. They cannot serve as DI tokens. Symbols solve this problem.\nconst ORDER_REPOSITORY = Symbol(\u0026#34;ORDER_REPOSITORY\u0026#34;); interface OrderRepository { fetchById(orderId: string): Promise\u0026lt;Order | null\u0026gt;; } The module registers the Symbol as the token and the implementation as the Provider.\n{ provide: ORDER_REPOSITORY, useClass: OrderRepositoryImpl } The consumer specifies the token with the @Inject decorator.\nconstructor(@Inject(ORDER_REPOSITORY) private readonly repo: OrderRepository) {} This pattern implements the Dependency Inversion Principle. Business logic depends only on interfaces. Implementations swap through Module configuration.\nScope Controls Provider lifecycle.\nDEFAULT: Singleton. One instance shared across the entire application. REQUEST: New instance per request. TRANSIENT: New instance per injection. DEFAULT (singleton) fits most cases. REQUEST scope applies only when per-request state is needed.\nController and Request Pipeline Controllers receive HTTP requests and delegate to services. They handle routing and HTTP concerns only.\n@Controller(\u0026#34;/orders\u0026#34;) class OrderController { constructor(private readonly orderService: OrderService) {} @Post() public async create(@Body() request: CreateOrderDto): Promise\u0026lt;OrderDto\u0026gt; { return this.orderService.create(request); } } Nest.js provides four tools for intercepting request processing.\nGuard: Authentication and authorization checks. Runs before the request reaches the Controller. Interceptor: Request/response transformation, logging, metric collection. Separates cross-cutting concerns through the AOP pattern. Pipe: Input data transformation and validation. Filter: Exception handling. Converts errors into appropriate HTTP responses. Global registration applies them to all requests.\n@Module({ providers: [ { provide: APP_INTERCEPTOR, useClass: MetricInterceptor }, { provide: APP_FILTER, useClass: DefaultExceptionFilter }, ], }) class AppModule {} Practical Patterns Repository Pattern and DIP The Symbol token pattern from earlier applies directly to the data access layer as the Repository pattern. Abstracts data access. Define interfaces in the business logic layer; provide implementations in the infrastructure layer.\n// Interface (business layer) const PRODUCT_REPOSITORY = Symbol(\u0026#34;PRODUCT_REPOSITORY\u0026#34;); interface ProductRepository { fetchById(productId: string): Promise\u0026lt;Product | null\u0026gt;; } // Implementation (infrastructure layer) @Injectable() class ProductRepositoryImpl implements ProductRepository { constructor( @InjectRepository(ProductEntity) private readonly repository: Repository\u0026lt;ProductEntity\u0026gt;, ) {} public async fetchById(productId: string): Promise\u0026lt;Product | null\u0026gt; { return this.repository.findOneBy({ id: productId }); } } The module connects token and implementation.\n{ provide: PRODUCT_REPOSITORY, useClass: ProductRepositoryImpl } Swapping databases or injecting mocks for tests requires only a Module configuration change. Business logic stays untouched.\nEncapsulation Through Module Boundaries Use exports to expose only a Facade. Internal implementations stay hidden.\n@Module({ providers: [OrderService, ProductService, ShippingService], exports: [OrderService], }) class OrderServiceModule {} Other modules can use only OrderService. ProductService and ShippingService remain internal to OrderServiceModule.\nAs Express projects grow, dependency management becomes the developer\u0026rsquo;s burden. Nest.js addresses this at the framework level. It controls dependency direction, enforces encapsulation through module boundaries, and makes implementations swappable through the Provider pattern.\nReferences Dependency Injection — The Hierarchy of DIP, IoC, and DI — The abstraction levels of IoC/DI that Nest.js implements, and the DIP principle they sit beneath. ","permalink":"https://wid-blog.github.io/en/posts/tech/language/nestjs-di-module-system/","summary":"Nest.js provides a DI container and Module system at the framework level in the Node.js ecosystem. A summary of its core design principles: IoC, DI, Module, and Provider.","title":"Nest.js Fundamentals — DI and Module System"},{"content":"As code grows, separation of concerns becomes necessary. When HTTP request handling, business logic, and database access mix in a single class, the blast radius of changes becomes unpredictable. Layered architecture solves this by separating code into horizontal layers based on technical responsibility.\nThe core is not layer separation itself but controlling the direction of dependencies.\nTraditional Three-Layer Structure The most basic form consists of Presentation, Business Logic, and Data Access layers. It mirrors the flow of receiving a web request, executing business logic, and persisting to a database.\nDependencies flow downward. Presentation calls Business Logic; Business Logic calls Data Access.\nThe problem: business logic depends directly on data access technology. Switching from MySQL to MongoDB or changing an external API integration requires modifying business logic. Core rules remain unchanged, yet technical choices force changes to the heart of the application.\nFour-Layer Structure To address this, layers are further refined and dependency direction is redesigned.\nPresentation: Handles HTTP request/response processing. Routing, request parsing, and response serialization belong here. Controllers, interceptors, and exception filters live in this layer.\nApplication: Orchestrates business logic. Calls multiple services, transforms data, and defines transaction boundaries. This layer serves as the entry point for use cases. Interfaces for infrastructure dependencies are also defined here.\nDomain: Contains business entities and rules. Pure models with no technical dependencies. Changes in other layers do not affect this layer.\nInfra: Handles technical implementations. Database access, external API calls, messaging, and caching all belong here. Implements interfaces defined in the Application layer.\nDependency Direction and DIP In the three-layer structure, dependencies flow Presentation → Business → Data, always downward. The four-layer structure inverts this direction.\nThe core principle is the Dependency Inversion Principle, DIP. Upper layers do not depend on lower layers directly. Instead, upper layers define interfaces, and lower layers implement them.\nPresentation → Application → Domain ↑ Infra (implements interfaces) The Application layer knows only that \u0026ldquo;an order gets saved.\u0026rdquo; Whether that save targets MySQL or MongoDB remains unknown. The Infra layer implements the interface, and a DI container connects the two.\nIn this structure, Domain depends on nothing. It is the most stable layer. Infra depends on interfaces defined by Application. Dependency arrows converge toward the core business logic.\nCode Example Define interfaces in the Application layer; place implementations in the Infra layer.\n// application/abstraction/order-repository.ts interface OrderRepository { save(order: Order): Promise\u0026lt;Order\u0026gt;; findById(id: string): Promise\u0026lt;Order | null\u0026gt;; } // infra/persistence/order-repository-impl.ts class OrderRepositoryImpl implements OrderRepository { constructor(private readonly db: Database) {} async save(order: Order): Promise\u0026lt;Order\u0026gt; { const entity = OrderEntity.from(order); await this.db.save(entity); return entity.toDomain(); } async findById(id: string): Promise\u0026lt;Order | null\u0026gt; { const entity = await this.db.findById(id); return entity?.toDomain() ?? null; } } The Application layer service depends only on the interface.\n// application/service/order-service.ts class OrderService { constructor(private readonly orderRepository: OrderRepository) {} async createOrder(request: CreateOrderDto): Promise\u0026lt;OrderDto\u0026gt; { const order = Order.create(request); const saved = await this.orderRepository.save(order); return OrderDto.from(saved); } } Swapping databases or injecting mocks for tests requires changing only the Infra layer implementation. OrderService stays untouched.\nWhen It Fits Layered architecture suits services where business logic concentrates in a single domain. When one API executes a complex processing pipeline, layer-by-layer responsibility separation pays off.\nIt also works well in environments with frequent technology changes. Replacing a filtering engine from an internal implementation to an external library, or switching databases, requires modifying only the Infra layer. The Application layer\u0026rsquo;s interfaces block change propagation.\nLimitations For services with multiple domains, it may not be the best fit. When users, products, and orders each form independent domains, Vertical Slice Architecture by domain achieves higher cohesion than horizontal layering.\nService class bloat is another concern. When many use cases accumulate in a single Application service, the code grows long. Splitting services by use case or applying the Facade pattern to reduce the external interface helps.\nDTO conversion costs between layers also arise. Separate Presentation DTOs, Application DTOs, Domain entities, and Infra entities mean more conversion code. The benefits of layer boundaries must be balanced against conversion overhead.\nAs code grows, separation of concerns becomes necessary. Layered architecture performs that separation by technical responsibility. Which direction dependencies point matters more than the layers themselves.\n","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/layered-architecture/","summary":"Layered architecture separates code into horizontal layers by technical responsibility. A summary of the four-layer structure, dependency direction rules, and how DIP decouples layers.","title":"Layered Architecture and Dependency Inversion"},{"content":"Kafka is a distributed event streaming platform. It provides a structure for publishing and subscribing to large volumes of events in real time. It serves real-time data pipelines, event-driven architectures, log aggregation, and more.\nTopics and Partitions Topics In Kafka, messages are published to a Topic. A topic is a logical category of messages. Topics are created per event type: order-events, user-signups, and so on.\nA topic is an append-only log that stores messages. Once written, messages are immutable. They are deleted when the retention period expires.\nPartitions A single topic is divided into multiple Partitions. Partitions are the core unit that provides both parallelism and ordering guarantees.\nflowchart LR subgraph Topic[\"Topic: order-events\"] P0[\"Partition 0msg0, msg3, msg6...\"] P1[\"Partition 1msg1, msg4, msg7...\"] P2[\"Partition 2msg2, msg5, msg8...\"] end Messages within a partition maintain order. Across partitions, no ordering is guaranteed. Messages with the same key are assigned to the same partition, ensuring event ordering for a specific entity (e.g., a particular order).\nIncreasing partition count increases throughput, since multiple consumers can process each partition in parallel.\nOffset Each message within a partition has a unique Offset number, starting from 0 and incrementing sequentially. Offsets serve as the reference point for tracking \u0026ldquo;how far a consumer has read.\u0026rdquo;\nProducer A Producer publishes messages to a topic.\nWhen a producer sends a message, it must decide which partition to target.\nKey-based partitioning. When a message has a key, a hash of the key determines the partition. The same key always maps to the same partition. This is used when event ordering for a specific user or order is required.\nRound robin. Without a key, messages are distributed across partitions in sequence. Suitable when ordering is unnecessary and even load distribution is desired.\nCustom partitioner. Custom partitioning logic can be implemented. Used when specific business rules dictate partition selection.\nAcks The producer can configure the level of acknowledgment required from brokers.\nacks=0: No acknowledgment. Fastest, but messages can be lost. acks=1: Leader broker acknowledges after writing. Messages can still be lost if the leader fails before replication. acks=all: All ISR (In-Sync Replicas) acknowledge. Safest, but increases latency. Consumer A Consumer reads messages from a topic. Unlike the producer\u0026rsquo;s \u0026ldquo;push,\u0026rdquo; consumers \u0026ldquo;pull\u0026rdquo; messages themselves, processing at their own pace.\nConsumers commit the offset of messages they have read. Committed offsets are stored in an internal Kafka topic (__consumer_offsets). When a consumer restarts, it resumes from the last committed offset.\nConsumer Groups Multiple consumers can be grouped into a Consumer Group. Within the same group, each partition is assigned to exactly one consumer.\nflowchart LR subgraph Topic[\"Topic (3 Partitions)\"] P0[\"P0\"] P1[\"P1\"] P2[\"P2\"] end subgraph Group[\"Consumer Group A\"] C1[\"Consumer 1\"] C2[\"Consumer 2\"] C3[\"Consumer 3\"] end P0 --\u003e C1 P1 --\u003e C2 P2 --\u003e C3 If the number of consumers exceeds the number of partitions, the excess consumers remain idle. To increase throughput, increase the partition count first.\nWhen consumers join or leave a group, Rebalancing occurs — the process of reassigning partitions. During rebalancing, message processing for that group pauses temporarily.\nMultiple Consumer Groups Different consumer groups read the same topic independently, each managing its own offsets.\nflowchart LR subgraph Topic[\"Topic (3 Partitions)\"] P0[\"P0\"] P1[\"P1\"] P2[\"P2\"] end subgraph GA[\"Group A (Order Processing)\"] A1[\"Consumer A1\"] A2[\"Consumer A2\"] end subgraph GB[\"Group B (Analytics)\"] B1[\"Consumer B1\"] end P0 --\u003e A1 P1 --\u003e A2 P2 --\u003e A1 P0 --\u003e B1 P1 --\u003e B1 P2 --\u003e B1 Multiple consumer groups subscribing to a single topic is the pub/sub pattern. A common example: an order processing system and an analytics system independently consuming the same events.\nBrokers and Clusters Broker A Broker is a Kafka server instance. It receives messages, persists them to disk, and delivers them to consumers. Multiple brokers form a Cluster.\nEach partition is assigned to one broker as the Leader. Producers and consumers communicate with the leader broker.\nReplication Partitions are replicated across multiple brokers. If the leader fails, one of the followers is promoted to the new leader.\nflowchart TB subgraph Cluster[\"Kafka Cluster\"] subgraph B1[\"Broker 1\"] P0L[\"P0 (Leader)\"] P1F[\"P1 (Follower)\"] end subgraph B2[\"Broker 2\"] P0F[\"P0 (Follower)\"] P1L[\"P1 (Leader)\"] end subgraph B3[\"Broker 3\"] P0F2[\"P0 (Follower)\"] P1F2[\"P1 (Follower)\"] end end P0L -.-\u003e|replication| P0F P0L -.-\u003e|replication| P0F2 P1L -.-\u003e|replication| P1F P1L -.-\u003e|replication| P1F2 ISR, In-Sync Replicas, is the set of replicas synchronized with the leader. If a follower falls behind, it is removed from the ISR. With acks=all, writes are acknowledged only after all ISR replicas have recorded the message.\nmin.insync.replicas sets the minimum ISR count. With a replication factor of 3 and min ISR of 2, writes succeed even if one broker fails. If two brokers fail, writes are rejected to protect data consistency.\nZooKeeper and Its Limitations Before Kafka 3.3, ZooKeeper managed cluster metadata: broker lists, topic/partition configurations, controller election, and ACL information.\nThe ZooKeeper-based architecture had several problems.\nOperational overhead of a separate system. A ZooKeeper cluster (typically 3-5 nodes) must be operated alongside the Kafka cluster. Monitoring, upgrades, and incident response targets double.\nMetadata propagation bottleneck. Brokers fetch metadata from ZooKeeper, so as partition counts grow, metadata synchronization takes longer. This slows controller failover recovery in large clusters.\nDual consensus problem. ZooKeeper runs its own consensus algorithm (ZAB), while Kafka separately operates ISR-based replication. The two systems can temporarily fall out of sync.\nKRaft Mode KRaft, Kafka Raft, removes ZooKeeper and lets Kafka manage metadata internally. Production use became available in Kafka 3.3, and ZooKeeper mode was removed starting from 4.0.\nIn KRaft, some brokers take on the Controller role. Controller nodes use the Raft consensus algorithm to agree on a metadata log. Metadata is stored in an internal Kafka topic, eliminating the need for a separate system.\nKey changes from ZooKeeper mode:\nNo ZooKeeper cluster. The operational target reduces to Kafka alone. Metadata is managed as an event log. Brokers subscribe to the metadata log and maintain their own state. Propagation is faster than polling from ZooKeeper. Controller failover speeds up. The Raft protocol elects a new leader who takes over the metadata log. Summary Kafka\u0026rsquo;s core consists of topics, partitions, and consumer groups. Partitions provide parallelism and ordering guarantees. Consumer groups enable horizontal scaling. Broker replication ensures fault tolerance.\nKRaft mode removed ZooKeeper as an external dependency from this structure. Kafka now handles metadata consensus and management on its own.\n","permalink":"https://wid-blog.github.io/en/posts/tech/infra/kafka-fundamentals-kraft/","summary":"Core Kafka concepts (topics, partitions, consumer groups, replication) and the background behind KRaft mode, which removes the ZooKeeper dependency.","title":"Kafka Fundamentals and KRaft Mode"},{"content":"The core of Hexagonal Architecture (Ports \u0026amp; Adapters) is dependency direction control. It isolates all external dependencies behind interfaces (Ports) so that business logic never depends on frameworks or databases.\nGo\u0026rsquo;s implicit interfaces and package structure make this pattern a natural fit.\nHexagonal Architecture This pattern, proposed by Alistair Cockburn, divides an application into three areas.\nDomain. The core layer containing business rules. It depends on no external technology.\nPort. The interface between the application and the outside world. Two kinds exist:\nDriving port (inbound): Entry points from outside into the application. Defines what the application offers. Driven port (outbound): Interfaces through which the application requests external systems. Defines what the application needs. Adapter. The implementation of a Port. Driving adapters (HTTP handlers, gRPC handlers) receive external requests and call ports. Driven adapters (DB repositories, message brokers) implement port interfaces to communicate with external systems.\nDependencies always point inward: Adapter → Port → Domain. Domain knows nothing about Port, and Port knows nothing about Adapter.\nflowchart LR subgraph Adapter[\"Adapter\"] DA[\"Driving AdapterREST, gRPC\"] DRA[\"Driven AdapterDB, Kafka\"] end subgraph Port[\"Port\"] DP[\"Driving Port\"] DRP[\"Driven Port\"] end subgraph Core[\"Domain + Application\"] D[\"Entity\"] A[\"UseCase / Service\"] end DA --\u003e|calls| DP DP -.-\u003e|defines| A A --\u003e|uses| DRP DRP -.-\u003e|implements| DRA A --\u003e|contains| D Go Directory Structure A directory structure commonly used when applying Hexagonal Architecture in Go:\ninternal/ ├── domain/ │ ├── entity/ # Business entities │ └── service/ # Domain services ├── port/ │ ├── driving/ # Inbound interfaces │ └── driven/ # Outbound interfaces ├── application/ │ ├── usecase/ # Business operation units │ ├── dto/ # Data transfer objects │ └── mapper/ # entity ↔ dto conversion └── adapter/ ├── driving/ # REST handler, gRPC handler └── driven/ # DB repository, message broker The internal/ package prevents direct access from external modules, naturally encapsulating the application\u0026rsquo;s internals.\nPort Ports are defined as Go interfaces.\nDriving Port Entry points from outside into the application. Defining one interface per use case gives each interface a single responsibility.\n// port/driving/messenger.go type JoinRoomUseCase interface { Handle(ctx context.Context, req dto.JoinRequest) error } type SendMessageUseCase interface { Handle(ctx context.Context, req dto.SendRequest) error } Driven Port Interfaces through which the application requests external systems.\n// port/driven/message.go type MessageRepository interface { Create(ctx context.Context, message entity.Message) error FindByRoom(ctx context.Context, roomID string, cursor string, limit int) ([]entity.Message, error) } type MessageBroker interface { Publish(ctx context.Context, message entity.Message) error Subscribe(subscriber MessageSubscriber) } Implicit Interfaces Go interfaces are satisfied implicitly. If an adapter has the methods defined by a port interface, it satisfies that interface without any explicit declaration — no implements keyword like Java.\nThis characteristic suits Hexagonal Architecture well. A driven adapter implementing a driven port does not need to import the port package. Dependencies stay separated at the code level too.\nTo guarantee interface compliance at compile time, a common convention exists:\nvar _ driven.MessageRepository = (*MongoMessageRepository)(nil) This single line verifies at compile time that MongoMessageRepository satisfies driven.MessageRepository.\nAdapter Driving Adapter An HTTP handler is a typical driving adapter. It receives external requests and calls the driving port (use case).\n// adapter/driving/rest/handler.go type Handler struct { sendUseCase driving.SendMessageUseCase } func NewHandler(uc driving.SendMessageUseCase) *Handler { return \u0026amp;Handler{sendUseCase: uc} } func (h *Handler) Send(c *gin.Context) { var req dto.SendRequest if err := c.ShouldBindJSON(\u0026amp;req); err != nil { c.JSON(http.StatusBadRequest, gin.H{\u0026#34;error\u0026#34;: err.Error()}) return } if err := h.sendUseCase.Handle(c.Request.Context(), req); err != nil { c.JSON(http.StatusInternalServerError, gin.H{\u0026#34;error\u0026#34;: err.Error()}) return } c.Status(http.StatusOK) } The handler depends only on the driving port interface. It has no knowledge of which implementation actually runs.\nDriven Adapter A DB repository is a typical driven adapter. It implements the driven port interface.\n// adapter/driven/persistence/repository.go type MongoMessageRepository struct { collection *mongo.Collection } func NewMongoMessageRepository(db *mongo.Database) *MongoMessageRepository { return \u0026amp;MongoMessageRepository{ collection: db.Collection(\u0026#34;messages\u0026#34;), } } func (r *MongoMessageRepository) Create(ctx context.Context, message entity.Message) error { doc := orm.FromMessage(message) _, err := r.collection.InsertOne(ctx, doc) if err != nil { return fmt.Errorf(\u0026#34;insert message: %w\u0026#34;, err) } return nil } ORM models and domain entities use separate structs. orm.FromMessage() and ToDomain() methods handle conversion, keeping domain entities independent of the database schema.\nDomain and Application Entity Domain entities contain business rules. Fields are unexported (lowercase) with getter methods.\n// domain/entity/message.go type Message struct { id string roomID string userID string body string sentAt time.Time } func NewMessage(roomID, userID, body string) Message { return Message{ id: uuid.New().String(), roomID: roomID, userID: userID, body: body, sentAt: time.Now(), } } func (m Message) ID() string { return m.id } func (m Message) RoomID() string { return m.roomID } func (m Message) Body() string { return m.body } Unexported fields prevent direct external modification. Creation only happens through the NewMessage constructor, protecting domain invariants.\nUseCase A use case handles one business operation. It implements a driving port and depends on driven ports.\n// application/usecase/send.go type SendUseCase struct { repo driven.MessageRepository broker driven.MessageBroker } func NewSendUseCase(repo driven.MessageRepository, broker driven.MessageBroker) *SendUseCase { return \u0026amp;SendUseCase{repo: repo, broker: broker} } func (uc *SendUseCase) Handle(ctx context.Context, req dto.SendRequest) error { message := entity.NewMessage(req.RoomID, req.UserID, req.Body) if err := uc.repo.Create(ctx, message); err != nil { return fmt.Errorf(\u0026#34;save message: %w\u0026#34;, err) } if err := uc.broker.Publish(ctx, message); err != nil { return fmt.Errorf(\u0026#34;publish message: %w\u0026#34;, err) } return nil } The use case depends only on driven port interfaces. Whether the backing store is MongoDB or PostgreSQL, any implementation of MessageRepository can be swapped in.\nDependency Injection In Go, assembling dependencies directly in the main function without a DI framework is the common approach.\nfunc main() { // driven adapters db := mongodb.Connect(os.Getenv(\u0026#34;MONGO_URI\u0026#34;)) messageRepo := repository.NewMongoMessageRepository(db) broker := messaging.NewKafkaBroker(kafkaConfig) // use case (inject driven ports) sendUseCase := usecase.NewSendUseCase(messageRepo, broker) // driving adapter (inject driving port) handler := rest.NewHandler(sendUseCase) // start server server := rest.NewServer(handler) server.Run(\u0026#34;:8080\u0026#34;) } The dependency graph appears explicitly in one place. Tracing which implementation is injected into which interface requires nothing more than reading the code.\nIn Java/Spring, @Component and @Autowired let the framework inject dependencies automatically. In Go, this process is manual — but the dependency flow stays explicit and easy to trace.\nSummary Hexagonal Architecture implementation varies by language idiom. In Go, implicit interfaces, the internal package, and manual DI align well with this pattern. Define ports as interfaces, implement them in adapters, and assemble everything in main. Dependency direction appears directly in the code structure, no framework required.\n","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/go-hexagonal-architecture/","summary":"Core concepts of Hexagonal Architecture and its idiomatic implementation in Go using implicit interfaces and package structure for dependency direction control.","title":"Implementing Hexagonal Architecture in Go"},{"content":"When a constructor takes too many parameters, the call site grows hard to read. A call like new Pizza(true, false, true, false, true, \u0026quot;cheese\u0026quot;, 12) forces the reader back to the constructor definition to make sense of each argument. When some parameters are optional, a common workaround is to define several constructors with different parameter counts — the Telescoping Constructor anti-pattern. This is the backdrop for Builder.\nBuilder shows up when three limits arrive together: many parameters, some optional, and step-wise validation. With only one of the three, a simpler tool suffices — constructor overloading, a static factory method, setters. When all three coincide, Builder reads most clearly. If the language offers named and default parameters, however, the same limits ease and Builder\u0026rsquo;s role shrinks accordingly.\nThe Shared Intent Builder carries three intents.\nStep-wise creation — Objects are not built in one shot. They take shape through a sequence of method calls. Parameter validation — Consistency checks happen at the moment all parameters are in place (build()). With a constructor, validation logic scatters across each parameter. Immutable result — The final object emerges without setters, avoiding the partial-initialization state a setter-based bean creates. The three intents converging in one pattern is what defines Builder. Drop one and a simpler tool fits better.\nGoF Form vs Fluent Builder The GoF formulation uses four roles: Director, Builder, ConcreteBuilder, Product.\nDirector decides the creation sequence and calls methods on the Builder interface. ConcreteBuilder implements that interface to produce the actual Product. Hand the same Director a different ConcreteBuilder and a different Product comes out. Producing different representations through the same construction process is the core of the GoF form.\nIn practice, the fluent builder Joshua Bloch organized in Effective Java Item 2 shows up more often.\nPizza pizza = new Pizza.Builder() .size(12) .cheese(true) .pepperoni(true) .build(); Director is gone. Builder exposes setter-shaped methods in a fluent chain. When build() is invoked, every parameter is in place, validation runs, and an immutable Product is returned.\nThe GoF form has its strength in producing multiple representations through the same procedure. The fluent builder has its strength in readability when the single object has many parameters. The fluent form dominates in practice.\nLanguage Implementations Each language expresses the same intent differently.\nJava has the richest Builder ecosystem. Hand-written builders work fine, and Lombok\u0026rsquo;s @Builder automates the generation. @Builder.Default covers defaults; @Singular handles incremental additions to collections. The cost is a dependency on Lombok\u0026rsquo;s compile-time abstractions.\nPython rarely needs an explicit Builder. dataclass plus defaults already covers most of the same ground.\n@dataclass class Pizza: size: int cheese: bool = False pepperoni: bool = False The call site reads as Pizza(size=12, cheese=True). Many or optional parameters do not produce a Telescoping problem. Validation gathers in __post_init__. An explicit Builder class shows up only when validation is complex or fields depend on each other in steps.\nTypeScript is similar. An object literal with an interface fills the same role.\ninterface PizzaOptions { size: number; cheese?: boolean; pepperoni?: boolean; } const pizza = new Pizza({ size: 12, cheese: true }); Optional fields use ?, and the caller passes a named object. A fluent builder is possible, but the object literal is shorter and clearer.\nThe richer the language\u0026rsquo;s support for named and default parameters, the smaller the room an explicit Builder needs.\nConstructor, Static Factory Method, Builder Comparing the three tools by parameter count and optionality clarifies the choice.\nFew parameters (1–3) — Constructor. Adding another tool only adds noise. Medium count with a need for naming — Static factory method (Pizza.cheesePizza() and the like). Many parameters, some optional, with validation — Builder. Effective Java Item 2 suggests a builder once there are four or more parameters with some optional. In practice, a more conservative rule — reach for Builder only when constructors and static factory methods clearly fall short — preserves simplicity.\nThe three are not mutually exclusive. A class often exposes both a static factory method and a builder. Stream.builder() is a familiar example.\nWhen the Language Absorbs the Pattern Two of Builder\u0026rsquo;s three limits — parameter count and optionality — can be absorbed into the language itself.\nKotlin\u0026rsquo;s named parameter plus default value is the canonical case.\ndata class Pizza( val size: Int, val cheese: Boolean = false, val pepperoni: Boolean = false ) val pizza = Pizza(size = 12, cheese = true) The call site reads as cleanly as a fluent builder, in less code. The Telescoping Constructor problem disappears at the syntax level. Scala\u0026rsquo;s case classes and Swift\u0026rsquo;s named arguments follow the same direction.\nWhen validation is step-wise or depends on previous fields, Builder still has a place. But the simpler limits — parameter count and optionality — get resolved by named/default parameters, and Builder shrinks naturally.\nThis is one of the cases where a design pattern gets absorbed into language features over time. The pattern does not disappear; the problem it solved gets handled at the language level.\nConclusion Builder is the choice when a constructor\u0026rsquo;s three limits — parameter count, optionality, and step-wise validation — meet at once. With only one of the three, a simpler tool fits better.\nThe choice narrows down to two questions.\nAre there four or more parameters, with some optional, and does validation belong in one place? If so, Builder. Does the language offer named and default parameters richly? If so, Builder\u0026rsquo;s place narrows to the cases where step-wise validation truly matters. The GoF Director/Builder/Product form sees less use today; Effective Java\u0026rsquo;s fluent builder dominates. Pattern shapes evolve with the era and the language.\nReferences Factory — How Static Factory Method and Builder compare in the decision criteria Singleton — getInstance as another form of static factory method Joshua Bloch — Effective Java (3rd ed.), Item 2: Consider a builder when faced with many constructor parameters GoF — Design Patterns: Elements of Reusable Object-Oriented Software (1994) ","permalink":"https://wid-blog.github.io/en/posts/tech/design-pattern/builder/","summary":"Builder is the answer when three limits of constructors meet at once — many parameters, some optional, and step-wise validation. With fewer than all three, simpler tools suffice. When the language provides rich named/default parameters, the need for Builder shrinks as well.","title":"Builder"},{"content":"When object creation happens directly at every call site, the caller depends on a concrete class. Swapping the implementation requires touching every caller. Factory breaks that coupling — it concentrates creation responsibility in one place and lets callers depend only on the abstraction.\nThe name Factory points to three variants, not one pattern. Factory Method, Abstract Factory, and Static Factory Method. They get grouped together often, yet the intent and application differ. Three variants with different intent under a similar name.\nThe Shared Intent What the three variants share is separating creation from use.\nCode that creates objects directly means the caller knows a concrete class. A call like new MySQLConnection() carries the knowledge of MySQLConnection inside the caller, and replacing it with PostgreSQL forces every call site to change. Factory abstracts that call. The caller depends only on an interface like Connection, and what concrete implementation arrives is the Factory\u0026rsquo;s decision.\nThat is where the common ground ends. The variants part ways in how they split creation from use.\nFactory Method Factory Method delegates object creation to subclasses.\nA parent class defines an abstract method createProduct(), and subclasses implement it to fix the concrete class. The parent\u0026rsquo;s other methods use the result of that abstract method. The full call flow lives in the parent, with one decision delegated downward. It reads as a form of Template Method.\nThe defining trait is that it runs on inheritance. A new concrete class means a new subclass. It fits when the domain is stable and the extension point is clearly identified. It strains in languages where multiple inheritance is awkward, or when the extension point shifts often.\nThe JDK\u0026rsquo;s Collection.iterator() is canonical. The Collection interface defines Iterator creation abstractly, and implementations like ArrayList and HashSet return iterators that fit their internal structure.\nAbstract Factory Abstract Factory addresses consistent creation of a family of related objects.\nOne interface groups creation methods for several products. A UIFactory defines createButton(), createWindow(), and createScrollbar() together; MacUIFactory returns Mac-style components and WindowsUIFactory returns Windows-style ones. The caller gets a guarantee that objects within a family belong together.\nDB drivers follow the same pattern. DriverFactory groups createConnection(), createStatement(), and createResultSet(), and each DB-specific factory returns a consistent family.\nIt fits when products only make sense in pairs. Abstracting a single object\u0026rsquo;s creation through this pattern is overkill. And adding a new product to the family means changing every Factory implementation — a constraint that suits stable domains with well-defined product types.\nStatic Factory Method Static Factory Method is a static method that compensates for the limits of constructors. The pattern Joshua Bloch organized in Effective Java, Item 1.\nConstructors carry four limits. They have no names. The same signature cannot define multiple constructors. They must return a new instance on every call. The caller has to know the exact return type. Static Factory Method addresses all four.\nNamed creation — BigInteger.probablePrime() tells the reader what is created. Constructor overloads cannot carry the same meaning. Caching — Integer.valueOf(int) caches frequently used values (-128 to 127) and returns the same instance. The base of the Flyweight pattern. Varied return types — Declare an interface as the return type and return an implementation. The caller of Collections.unmodifiableList() does not know the concrete class. Changing the implementation does not affect callers. The returned class need not exist at call time — JDBC\u0026rsquo;s DriverManager.getConnection() is the canonical case. Which Driver is loaded at call time decides what class of instance comes back. Even the Java standard library shows the pattern repeatedly — Optional.of, List.of, Map.of, Stream.of, Files.newBufferedReader. It is the variant most often encountered in practice.\nLimits exist. Static methods do not inherit cleanly — without protected, subclasses cannot override them. And without an obvious name, the method becomes harder to discover than a constructor. Conventions like of, from, valueOf, getInstance, and newInstance exist to soften that.\nChoosing Among Them Following the threads of the three variants, the application conditions line up like this.\nMultiple concrete implementations behind the same signature — Factory Method (inheritance). When the domain is stable and the extension point is clear. Related objects that need to fit together — Abstract Factory (composition). UI families, DB driver families — products that require consistency together. Constructor limits get in the way (no names, no caching, concrete type exposure) — Static Factory Method (alternative). The most common choice in practice. The three are not mutually exclusive. A library often shows all three working in different places.\nRelation to DI Containers A DI container reads as a generalization of Factory.\nContainers handle both object creation and dependency injection. Configuration decides which implementation goes where, and the caller depends only on the abstraction (the interface). Where Abstract Factory takes care of family-consistent creation, the container takes the same role. Static Factory Method\u0026rsquo;s caching matches the effect of a container\u0026rsquo;s singleton scope.\nDoes that make explicit Factory disappear? For simple creation, the container suffices. But when the choice of concrete depends on domain logic at runtime, explicit Factory still reads more clearly. Picking a different PaymentProcessor per payment method, or a different DiscountPolicy per user tier — those are runtime decisions, not container configuration.\nThe DIP covered in the dependency-injection post is the abstraction that makes this separation possible, and Factory is one way to express that abstraction in code.\nConclusion Three variants gather under the name Factory, but the deciding axis differs. Factory Method extends through inheritance, Abstract Factory holds object families consistent, and Static Factory Method compensates for constructor limits. The same intent — separating creation from use — gets expressed three different ways.\nThe variant most often met in practice is Static Factory Method. From the Java standard library to domain code, the same pattern repeats. Factory Method and Abstract Factory get picked when the situation fits specific conditions — inheritability available, object family present.\nThe names look similar enough to be confused for one pattern, but the application decision separates by intent.\nReferences Singleton — Where Static Factory Method\u0026rsquo;s getInstance meets single-instance guarantee Dependency Injection — The Hierarchy of DIP, IoC, and DI — How a DI container generalizes Factory Joshua Bloch — Effective Java (3rd ed.), Item 1: Consider static factory methods instead of constructors GoF — Design Patterns: Elements of Reusable Object-Oriented Software (1994) ","permalink":"https://wid-blog.github.io/en/posts/tech/design-pattern/factory/","summary":"Factory\u0026rsquo;s shared intent is separating creation from use. The three variants — Factory Method, Abstract Factory, and Static Factory Method — split creation differently and suit different conditions. Static Factory Method is the variant most often encountered in practice, and DI containers absorb part of Factory\u0026rsquo;s explicit role.","title":"Factory"},{"content":"Singleton is one of the simplest GoF patterns and one of the first taught. At the same time, in practice it gets labeled as an anti-pattern more often than any other. The reason \u0026ldquo;fundamental\u0026rdquo; and \u0026ldquo;anti-pattern\u0026rdquo; land on the same pattern is not in the pattern itself. It is that Singleton bundles two intents into one.\nSingle-instance guarantee and global access. The moment the two are bundled, tight coupling and test difficulty follow. This is the same context the dependency-injection post covers when explaining why DI emerged.\nGoF Intent GoF defines Singleton as \u0026ldquo;ensure a class has only one instance and provide a global point of access to it.\u0026rdquo; Two guarantees, in one bundle.\nSingle-instance guarantee — A lifecycle decision. There is a domain fact that only one instance of the resource should exist within the process. Global access — A dependency-expression decision. How the client obtains the instance (constructor injection / method call / global reference). The two decisions are separate axes by nature. Singleton bundled them into one pattern to gain simplicity. That simplicity is also where the anti-pattern debate starts.\nLanguage Implementations Each language handles thread safety differently.\nJava has four common implementations.\nEager initialization — Created at class loading. Simplest, but holds memory even when unused. Lazy initialization — Created at first call. Requires manual thread safety. DCL (Double-Checked Locking) — volatile plus two-stage null check for both thread safety and lazy init. Breaks subtly if volatile is missing. Initialization-on-demand Holder — Uses an inner static class to leverage JVM\u0026rsquo;s class loading guarantees. Same effect as DCL without the subtlety. Python uses two common approaches.\nModule — A module loads exactly once at import, making it a natural Singleton. The most common approach. Metaclass — Overrides __call__ to control instance creation. Useful when class hierarchies are deep. TypeScript / JavaScript rely on the module system itself for single-instance guarantee. ESM\u0026rsquo;s module cache loads the same module exactly once, so exported objects naturally become Singletons. The class with static getInstance() pattern works too, but module export reads more naturally.\nOn thread safety, JVM-based languages (Java/Kotlin) are the trickiest; Python\u0026rsquo;s GIL and JavaScript\u0026rsquo;s single-threaded model remove much of the same concern.\nWhy It Becomes an Anti-Pattern Singleton gets labeled as an anti-pattern for four reasons.\nGlobal state — State accessible from anywhere is hard to trace. A change in module A affects what module B sees. The scope of tracking widens to the entire codebase. Test difficulty — Unit tests should run in isolation. A global instance is shared across tests, so changes from one test affect another. Mocking is also awkward — callers reference the concrete class directly, leaving no interface seam to swap. Tight coupling — A call like Logger.getInstance() means the caller knows Logger as a concrete class. Swapping the implementation (moving to a different logger library) touches every caller. No lifecycle control — A Singleton is created at the first call and lives until the process ends. The caller cannot decide an explicit release or recreate point. When the four work together, the codebase grows fragile to change over time. A change starting in one module can affect places that are hard to predict.\nContrast with DI DI containers separate Singleton\u0026rsquo;s two intents.\nThe single-instance guarantee becomes a scope setting on the container. Spring\u0026rsquo;s @Scope(\u0026quot;singleton\u0026quot;) or NestJS\u0026rsquo;s default provider scope provide the same effect. The container creates one instance and injects the same one wherever the dependency is needed.\nGlobal access is replaced by dependency injection. The client receives the dependency through a constructor parameter and does not know how it was obtained. Swapping implementations becomes a one-line container change, and tests inject mocks to create isolated environments.\nOnce the two intents are separated, the benefit of a single instance stays while the cost of global access disappears. DIP, covered in the dependency-injection post, is the abstraction that makes this separation possible.\nWhere It Still Fits DI being the general alternative does not make Singleton anti in every case. Within a narrow range, it still fits.\nThread pools, connection pools — Resources that must exist exactly once per process. Multiple pools create resource contention. Loggers — The output channel needs consistent management. Multiple instances scramble output order or format. Configuration loaders — Read once at process startup and held in memory. Multiple instances duplicate loading or break consistency. Caches — Sharing across the process is the point. A separate cache per instance defeats the cache itself. The two common traits are: lifecycle equals process lifetime, and state is the essence. There is no reason to scale instances up or down, and the resource is meant to be shared. External injection adds complexity without benefit.\nEven in these cases, expressing it through a DI container\u0026rsquo;s singleton scope is more flexible. Only in environments without a container (simple scripts, CLI tools, embedded) does direct Singleton implementation cost less.\nConclusion Singleton itself is not the anti — the decision to bundle single-instance guarantee and global access into one pattern is. Separate the two intents and you keep the benefits of a single instance without the tight coupling and test difficulty. DI is the tool that generalizes that separation.\nThe choice comes down to two questions.\nIs a DI container available? If yes, expressing it through the container\u0026rsquo;s singleton scope separates the two intents and is the first choice. Is this one of those narrow cases where separation adds more cost? In environments without a container, when the lifecycle equals process lifetime and state is the essence, direct Singleton implementation reads naturally. The simplest pattern, with the most nuanced application decision.\nReferences Dependency Injection — The Hierarchy of DIP, IoC, and DI — Why DI is the alternative to Singleton Misko Hevery — Singletons are Pathological Liars GoF — Design Patterns: Elements of Reusable Object-Oriented Software (1994) ","permalink":"https://wid-blog.github.io/en/posts/tech/design-pattern/singleton/","summary":"Singleton is one of the simplest patterns but the canonical anti-pattern debate. The decision to bundle single-instance guarantee with global access into one pattern causes tight coupling and test difficulty. DI is the general alternative that separates the two intents.","title":"Singleton"},{"content":"DI is used often and confused often. \u0026ldquo;DI = IoC,\u0026rdquo; \u0026ldquo;the DI Container is DIP,\u0026rdquo; \u0026ldquo;using Spring satisfies DIP\u0026rdquo; — these equivalences appear regularly in learning material and blog posts. The three are not in the same position; they sit at different levels of abstraction, and without seeing that hierarchy clearly it is easy to confuse framework features with design principles.\nDIP is a principle — \u0026ldquo;high-level modules do not depend on low-level modules.\u0026rdquo; IoC is a pattern — control flow is handed to something outside the caller. DI is a technique — dependencies are received from outside. Below them sits the DI Container, a tool. The principle is the most abstract; the pattern implements the principle; the technique implements the pattern; the tool automates the technique.\nflowchart TB A[DIPDesign Principle] --\u003e B[IoCControl Pattern] B --\u003e C[DIInjection Technique] C --\u003e D[DI ContainerAutomation Tool] DIP The direction of dependencies between modules is what DIP addresses. It is one of the five SOLID principles, and the definition reduces to two lines.\nHigh-level modules do not depend on low-level modules. Both depend on abstractions. Abstractions do not depend on details. Details depend on abstractions. The classic violation is business logic depending directly on data-access technology.\nclass OrderService { private final MySQLPaymentRepository repository = new MySQLPaymentRepository(); public void charge(Order order) { repository.save(order.payment()); } } OrderService knows about a concrete class, MySQLPaymentRepository. Switching the DB to PostgreSQL or swapping in a mock for tests requires editing OrderService. The dependency direction flows from high-level (business) to low-level (data access).\nApplying DIP reverses the direction by placing an abstraction between them.\ninterface PaymentRepository { void save(Payment payment); } class OrderService { private final PaymentRepository repository; public OrderService(PaymentRepository repository) { this.repository = repository; } public void charge(Order order) { repository.save(order.payment()); } } class MySQLPaymentRepository implements PaymentRepository { /* ... */ } On the surface this only added an interface, but the core point is who owns the interface. PaymentRepository is defined by the high-level module (the business layer). The low-level module (the infrastructure layer) implements that interface. The direction flips — the low-level depends on the abstraction owned by the high-level. That is the inversion.\nAdding an interface alone does not satisfy DIP if the infrastructure layer owns the interface. The dependency direction remains unchanged. DIP is a principle about the location of the interface, not its existence.\nIoC \u0026ldquo;Who calls whom\u0026rdquo; is the control direction IoC addresses. In typical code I call library functions. With IoC applied, the framework calls my code. The Hollywood Principle — \u0026ldquo;Don\u0026rsquo;t call us, we\u0026rsquo;ll call you\u0026rdquo; — is the often-cited slogan.\nIoC is one way to satisfy DIP. When control flows from framework to my code, my code does not construct its own dependencies. The framework constructs and hands them over. Depending on abstractions becomes natural.\nIoC has more than one implementation.\nDependency Injection. Dependencies are received from outside. Service Locator. Dependencies are fetched from a central registry. Template Method. A parent class defines the flow; subclasses fill in specific steps. Event-driven. Registered handlers run when an event occurs. DI is the most explicit and most frequently used of these. That is why \u0026ldquo;IoC = DI\u0026rdquo; is a common equivalence, but it is not accurate.\nDI Objects receiving their dependencies from outside instead of constructing them is the DI technique. Injection style varies by where the dependency is received.\nConstructor Injection. Dependencies are received as constructor parameters. Natural when dependencies are required and immutable. The most commonly recommended form. Setter Injection. Dependencies are received through setter methods. Used when dependencies are optional or may change at runtime. The trade-off is that the object can briefly exist without its dependencies. Interface Injection. A separate interface is defined to receive the dependency. Rarely used in practice. Constructor Injection is recommended because it expresses immutability and required-ness. A constructor-injected dependency can be declared final, and the object cannot exist without its dependencies. Setter injection guarantees neither.\nDI is the concrete implementation of IoC and, at the same time, a natural path to satisfying DIP. When the constructor receives an abstraction (interface), the object depends only on the abstraction without knowing the concrete implementation.\nDI Container Hand-wiring accumulates boilerplate as the dependency graph grows. The DI Container automates that wiring.\n// DI by hand PaymentRepository repository = new MySQLPaymentRepository(); OrderService orderService = new OrderService(repository); PaymentController controller = new PaymentController(orderService); // DI Container automating @Service class OrderService { public OrderService(PaymentRepository repository) { /* ... */ } } The automation is misread in two directions.\nDIP, IoC, and DI are all satisfied without a DI Container. Hand-wiring code in which the high-level module owns the interface and dependencies are injected via the constructor satisfies all three. In small systems, hand-wiring is enough.\nAdopting a DI Container alone does not satisfy DIP. Even with @Autowired annotations everywhere, if the interface partitioning is wrong — for instance, the business layer depending on an interface named Repository owned by the infrastructure layer — the dependency direction still flows from high-level to low-level. The DI Container is a tool, not a designer.\nThe Hierarchy Level Name Identity Example Principle DIP Design principle \u0026ldquo;High-level does not depend on low-level\u0026rdquo; Pattern IoC Control flow \u0026ldquo;The framework calls my code\u0026rdquo; Technique DI Dependency delivery \u0026ldquo;Inject through the constructor\u0026rdquo; Tool DI Container Automation Spring @Autowired, NestJS @Injectable Correcting Common Equivalences Seeing the hierarchy clearly exposes problems in equivalences that appear frequently.\n\u0026ldquo;DI = IoC.\u0026rdquo; DI is one implementation of IoC. Service Locator, Template Method, and Event-driven also satisfy IoC. Treating the two as identical traps thinking in the surface trait \u0026ldquo;the framework constructs objects\u0026rdquo; and hides the other shapes IoC can take.\n\u0026ldquo;Using a DI Container satisfies DIP.\u0026rdquo; When the interface\u0026rsquo;s location is wrong, the DI Container is meaningless. If the business layer depends on an interface owned by the infrastructure layer, automatic injection leaves the dependency direction unchanged. DIP is a principle about structure, not about wiring method.\n\u0026ldquo;Using Spring automatically gives DIP.\u0026rdquo; Frameworks make it easy to satisfy DIP, but they do not satisfy it automatically. Interface partitioning, module boundaries, and dependency direction remain the designer\u0026rsquo;s responsibility. The framework only provides a convenient environment for expressing those decisions in code.\nPrinciple, pattern, technique, and tool sit at different levels. Flattening them invites the complacency of \u0026ldquo;the framework took care of it.\u0026rdquo; DIP is the result of design, not the result of tools. Tools only help make that result easier to produce.\nReferences Nest.js Fundamentals — DI and Module System — How Nest.js\u0026rsquo;s DI Container and Module system implement this hierarchy. ","permalink":"https://wid-blog.github.io/en/posts/tech/design-pattern/dependency-injection/","summary":"DIP (principle), IoC (pattern), and DI (technique) sit at different levels of abstraction. The hierarchy must be clear before framework features and design principles can be told apart.","title":"Dependency Injection — The Hierarchy of DIP, IoC, and DI"},{"content":"Multiple services share a cache and periodically fetch configuration data. With full refresh, the entire dataset is transmitted every cycle regardless of whether anything changed. The less frequently the data changes, the greater the waste.\nRefreshing only changed items reduces network throughput to be proportional to the actual change rate.\nFull Refresh Full refresh is simple to implement. Every cycle, fetch all data and replace local state. No change detection logic needed.\nBut network throughput under this approach is data size × consumer count × refresh frequency. Whether the data actually changed is irrelevant. As consumers or data size grows, throughput scales linearly.\nA situation can arise where CPU and memory have headroom but network throughput hits the limit. Scaling up the instance to resolve this wastes CPU and memory capacity.\nData Separation Before applying incremental refresh, data must be separated by update frequency.\nConfiguration data. Entity metadata, conditions, and rules change only when an administrator modifies them. Update frequency is low.\nReal-time data. Counters and consumption metrics update with every request. They must always reflect the latest state. These are not candidates for incremental refresh.\nApply incremental refresh only to configuration data. Real-time data continues refreshing every cycle.\nDeduplication When an entity contains sub-entities that are also referenced by other entities, duplication can occur. Managing sub-entities separately reduces both storage and transmission volume.\nChange Detection Strategies Data Comparison A batch job fetches data from the source (database) and directly compares it against what is stored in the cache. Only items with different content are written.\nThe advantage is accuracy. Source-cache mismatches are never missed. The disadvantage is the additional read cost of fetching existing cache data for comparison.\nTimestamp-Based Record the change time for each item. Storing timestamps as Sorted Set scores enables range queries for items changed after a specific point in time.\nflowchart LR subgraph Write [\"Write\"] BATCH[\"Batch\"] --\u003e UPD[\"Update changed items\"] UPD --\u003e TS[\"Record timestampin Sorted Set\"] end subgraph Read [\"Read\"] SVC[\"Service\"] --\u003e RANGE[\"Range query:changes since last refresh\"] RANGE --\u003e CHANGED[\"Changed item IDs\"] CHANGED --\u003e GET[\"Fetch those items only\"] end When the reader remembers its last query time, it can fetch only items changed since then. Full scans become range queries, and throughput scales with the number of changes rather than total data size.\nBoth strategies can be combined. The write path uses data comparison to detect changes and records timestamps. The read path uses timestamp range queries to fetch changes.\nWrite Path / Read Path Clear separation of write and read paths is essential in incremental refresh.\nThe write path is handled by a batch job. It fetches data from the source, compares against the cache, writes only changes, and records change timestamps.\nThe read path is handled by services. They query only items changed since their last refresh and partially update local state. When nothing has changed, local data is retained as-is.\nDifferent consumers may need different data scopes. Some need only configuration. Others need configuration plus content. Others need real-time data as well. Separating read interfaces per consumer lets each service fetch only what it needs.\nUse Cases This pattern applies frequently to architectures where configuration data is periodically fetched from a shared cache.\nAd campaign configuration. Campaign metadata and targeting conditions change infrequently but are queried simultaneously by multiple servers. Switching from full to incremental refresh significantly reduces network throughput.\nProduct catalogs. Product information changes only on creation or modification. Refreshing only changed products instead of transmitting thousands every cycle is more efficient.\nUser permissions/settings. Permission changes are infrequent but referenced by many services. Incremental refresh fits this structure well.\nTrade-offs Incremental refresh adds complexity compared to full refresh.\nChange detection logic, timestamp management, and partial local state updates are all required. A mechanism for full synchronization to recover from cache-source inconsistencies should also be considered.\nWhen data is small or changes frequently, full refresh is simpler and sufficient. Switching to incremental refresh is appropriate when network cost has become the actual bottleneck.\n","permalink":"https://wid-blog.github.io/en/posts/tech/architecture/incremental-cache-refresh/","summary":"A pattern for switching from full cache refresh to incremental refresh. Separating data by update frequency and applying change detection reduces network costs.","title":"Incremental Cache Refresh Pattern"},{"content":"The cache had plenty of CPU and memory headroom. Network throughput was the bottleneck.\nMultiple ad servers periodically fetched campaign configuration from the cache. Campaign settings rarely changed. But the system pulled the entire dataset every cycle regardless of whether anything had been modified. As the number of servers grew, network throughput approached the instance\u0026rsquo;s baseline limit, and downscaling was off the table.\nData Separation Looking at the cached campaign data, I found three different types bundled together.\nMetadata. Campaign metadata and targeting conditions change infrequently. They only update when an advertiser modifies a campaign.\nState data. Budget consumption updates with every ad impression. It must always reflect the latest state.\nShared data. Ad creatives can be shared across multiple campaigns. Including them within campaign data creates duplication.\nI separated all three. Metadata and shared data switched to incremental refresh. State data continued refreshing every cycle.\nIncremental Refresh Switching from full refresh to incremental refresh requires knowing what has changed.\nA batch job fetches the latest data from the database, then compares it against what is stored in the cache. Only items with different content are written to the cache. Change timestamps are recorded in a dedicated change index. On the read side, services fetch only items changed since their last refresh.\nflowchart LR subgraph Write [\"Write Path\"] DB[\"DB\"] --\u003e BATCH[\"Batch\"] BATCH --\u003e CMP{\"Comparewith cache\"} CMP --\u003e|\"Changed\"| WRITE[\"Update cache+ record timestamp\"] CMP --\u003e|\"Same\"| SKIP[\"Skip\"] end subgraph Read [\"Read Path\"] SVC[\"Service\"] --\u003e TS{\"Changes sincelast refresh?\"} TS --\u003e|\"Yes\"| FETCH[\"Fetch changes only\"] TS --\u003e|\"No\"| LOCAL[\"Keep local data\"] end Separating write and read paths was the key. The batch writes only changes. Services read only changes. The detailed principles of this pattern are documented separately.\nResult Network throughput dropped significantly. During cycles with no changes, almost no data was transmitted. The cache instance could be downscaled to a smaller type.\nLooking back, the starting point of this work was accurately identifying the bottleneck. Confirming that the constraint was network, not CPU or memory, naturally led to data separation and incremental refresh as the direction.\nReference Incremental Cache Refresh Pattern ","permalink":"https://wid-blog.github.io/en/posts/career/dable/ad-campaign-cache-optimization/","summary":"How I reduced network costs and enabled instance downscaling by switching from full cache refresh to incremental refresh for campaign configuration data.","title":"Cache Refresh Optimization Retrospective"},{"content":"Transactions are a daily reality in backend development. Order creation, payment processing, inventory deduction — operations grouped into a single \u0026ldquo;all succeed or all fail\u0026rdquo; unit. The four letters of ACID name those guarantees, but what each letter actually guarantees is often understood only at the surface.\nC and I in particular are frequently misread. C gets simplified into \u0026ldquo;the DB handles consistency,\u0026rdquo; and I gets treated as a simple on/off. Each of the four properties has a line between what it guarantees and what it does not — that line is what this post draws.\nA (Atomicity) Either every operation in a transaction is applied, or none is. If something fails mid-way, changes up to that point are fully rolled back.\nTake an order creation transaction: insert an order row, decrement inventory, record payment — three operations grouped in one transaction. If the payment record step errors out, the order row and inventory change are both undone. No partial state like \u0026ldquo;the order went in but payment didn\u0026rsquo;t\u0026rdquo; exists.\nDBs typically implement this with an undo log. Each change records the prior value separately, and a rollback restores from that log. Once commit completes, the undo log becomes irrelevant.\nThe atomicity discussed here is limited to a single DB. Atomicity across distributed systems — multiple DBs or external services — requires separate mechanisms like 2PC or saga and is outside this series.\nC (Consistency) The most frequently misunderstood property. \u0026ldquo;The DB handles consistency\u0026rdquo; is only half correct.\nWhat C guarantees is the DB constraint layer — primary key uniqueness, foreign key referential integrity, check constraints, NOT NULL, unique indexes. If any of these are violated at commit time, the DB refuses the commit. This part is automatic.\nApplication invariants are a different story. \u0026ldquo;The sum of order amounts must equal the payment amount,\u0026rdquo; \u0026ldquo;inventory cannot go negative,\u0026rdquo; \u0026ldquo;a refund requires the original transaction to be in \u0026lsquo;paid\u0026rsquo; state\u0026rdquo; — business rules like these often cannot be expressed as DB constraints, or only partially.\nPreventing negative inventory works with CHECK (stock \u0026gt;= 0), but rules like \u0026ldquo;is this refundable?\u0026rdquo; that combine multiple rows and states fall to application code. How these rules are validated within a transaction, and in what order, is the application\u0026rsquo;s responsibility.\nThe \u0026lsquo;C\u0026rsquo; in ACID means transition to a consistent state. What \u0026ldquo;consistent\u0026rdquo; means is defined jointly by DB constraints and application invariants, and the DB enforces only the constraint portion. Without drawing this line clearly, bugs of the form \u0026ldquo;I thought the DB was handling it\u0026rdquo; emerge.\nI (Isolation) I controls what concurrent transactions can see of each other. When transaction T1 is mid-execution and T2 touches the same data, how much of each other\u0026rsquo;s intermediate state is visible.\nPerfect isolation — every transaction behaving as if executed serially — is the strongest correctness guarantee available. But that guarantee nearly kills concurrency. If only one transaction can run at a time, throughput collapses.\nSo RDBs offer isolation in levels, not as a single on/off switch. The choice is \u0026ldquo;which anomalies to permit.\u0026rdquo; Stronger isolation permits fewer anomalies but costs more concurrency.\nThe point of I is why those levels exist. Correctness and concurrency are in direct tension, and the next post in this series takes those four levels apart along with the anomalies each one blocks.\nD (Durability) Once a transaction commits, its result survives system failure. Power loss, OS crash, process kill — committed data is not lost.\nThe typical implementation is a Write-Ahead Log (WAL). Changes are written to a separate log file and fsync\u0026rsquo;d to disk before the main data file. After restart, replaying the log restores the committed state.\nD guarantees durability at commit time. Failures before commit mean the transaction was never applied, which connects naturally to A. Together, the two guarantees leave only two extremes: \u0026ldquo;definitely complete\u0026rdquo; or \u0026ldquo;as if never happened.\u0026rdquo;\nThe internals of WAL — checkpoints, log recycling, recovery algorithms — are a substantial topic in their own right, and this series stops at the conceptual level of what D guarantees.\nSummary A and D are relatively simple guarantees. All applied or none applied (A). Permanent after commit (D). The implementations are complex, but the guarantees themselves are clear.\nC has shared responsibility. DB constraints cover one side, application invariants cover the other. Missing this line leads to bugs where rules thought to be guaranteed by the transaction turn out not to be.\nI is a policy choice. Perfect isolation is expensive, weaker isolation permits anomalies. The reason four levels exist — the tension between correctness and concurrency — is the through-line of this whole series.\nThe next post takes those four levels and shows which anomalies each one blocks and which it allows.\n","permalink":"https://wid-blog.github.io/en/posts/tech/database/rdb-transaction-acid/","summary":"What each of the four ACID properties actually guarantees in an RDB transaction. A/C/D are relatively clear guarantees, but only I has \u0026rsquo;levels\u0026rsquo; — the gateway to the correctness vs. concurrency trade-off.","title":"What RDB Transaction ACID Actually Guarantees"},{"content":"Ad budget pacing distributes a campaign\u0026rsquo;s daily budget evenly across its eligible time window. Advertisers choose it when they want stable spend across the day rather than burning through the budget early.\nThe original approach applied the same rule to every campaign. Regardless of campaign characteristics, each one started from the same line, and the exposure probability was adjusted uniformly based on cumulative spend. It worked on average, but it broke wherever campaign characteristics varied sharply. Some campaigns burned through the budget early; others hit the initial cap and never fully spent.\nA single rule applied uniformly was the limit. I redesigned this feature as a two-layer control loop — per-campaign learning and real-time correction.\nLimits of a Single Rule The flaw was clear. Every campaign started from the same line. Campaign characteristics live as a distribution, not an average, and they shift by interval and by environment. A single rule that assumes the average breaks at the ends of that distribution — neither the fast spenders nor the under-spenders are absorbed by the same rule.\nApplying the same probability to every campaign means ignoring what each one actually is.\nTwo-Layer Structure Solving this inside a single layer ran into a contradiction. React quickly and exposure becomes jittery; react slowly and budget misses pile up. Frequent probability swings looked unstable to advertisers, while only large-scale adjustments could not absorb the in-between traffic variance.\nSo I split it into two layers.\nSlow controller. Analyzes a campaign\u0026rsquo;s recent spend pattern over a longer interval and derives a baseline for the next one.\nFast controller. Compares expected and actual spend at a shorter interval and absorbs the residual drift.\nThe slow controller hypothesizes the pace each campaign should follow. The fast controller checks whether the actual pace strays from that hypothesis. The two controllers\u0026rsquo; time scales divided the responsibility naturally.\nBaseline and correction, separated The point is correction magnitude. If the slow controller fails to set the baseline, the fast controller has to swing the probability hard at every step. The more accurate the baseline, the smaller the correction. Small corrections do not hurt exposure stability while still absorbing short-term shocks.\nThe baseline itself is measurement-driven — the pace observed in the previous interval becomes the seed for the next. A simple measure-apply-measure loop.\nA familiar trap follows. When the measurement is missing. Some campaigns had too little exposure in the previous interval to register a measurement at all. Without one, the correction has to come from outside as a default seed — the same shape as cold-start in general control systems, not a quirk of ad pacing.\nAfter Comparing the same campaigns over the same intervals before and after, three changes stood out. Initial spend concentration eased, the exposure spike at the top of each interval subsided, and under-spending campaigns saw their fill rate rise. The three changes are different faces of the same cause — moving from a fixed rule to per-campaign learning.\nRetrospective The biggest decision in this work was the split itself. Doing learning and correction inside a single layer would have made the probability jitter at every step; doing only large-scale adjustments would have missed the short-term shocks. Separating two signals with different time scales into two controllers turned out to be the natural division of responsibility.\nIt was interesting to see how cleanly control-loop language fits the ad domain. Problems familiar from control engineering — cold-start, integrator windup, missing measurements — appeared in ad pacing in the same shape. Where the same problem keeps reappearing in different domains, borrowing the established vocabulary keeps the thinking stable.\nNext time something with a similar grain shows up — measurement signals running at two or more time scales — splitting into layers from the start is where I would begin.\n","permalink":"https://wid-blog.github.io/en/posts/career/dable/balanced-pacing-control/","summary":"A retrospective on moving ad budget pacing from a fixed-rule scheme to a two-layer control loop — per-campaign learning sets the baseline, real-time correction absorbs drift.","title":"Two-Layer Control Loop for Ad Budget Pacing — Retrospective"},{"content":"I wanted hands-on experience with a low-level language. Managing memory through language rules, not a runtime. I chose Rust and followed the Rust Book Chapter 20 — a multithreaded HTTP server. About 200 lines, using only the standard library with no external crates.\nMemory Management Rust has no garbage collector. Instead, the ownership system determines when memory is freed at compile time.\nEvery value has exactly one owner. When the owner goes out of scope, the value is automatically dropped. Assigning a value to another variable moves ownership, and the original variable becomes unusable. The compiler enforces this.\nlet s1 = String::from(\u0026#34;hello\u0026#34;); let s2 = s1; // ownership moves // s1 is no longer usable — compile error Values can be borrowed without transferring ownership. Through references (\u0026amp;). Multiple immutable references can coexist, but only one mutable reference (\u0026amp;mut) is allowed at a time. This rule prevents data races at compile time.\nIn GC-based languages, the runtime reclaims memory. Rust delegates that decision to the compiler. Memory safety with zero runtime cost.\nThread Pool The core of the server is its thread pool. When a TCP connection arrives, work is distributed to worker threads.\nWork distribution uses channels. A single sender dispatches jobs, and multiple workers share the receiver. The problem was that Rust\u0026rsquo;s Receiver does not implement Clone. Sharing one receiver across multiple threads required a different approach.\nArc\u0026lt;Mutex\u0026lt;Receiver\u0026lt;T\u0026gt;\u0026gt;\u0026gt; was the answer. Arc enables multiple threads to own the same value through reference counting. Mutex ensures only one thread accesses the receiver at a time. This was where ownership rules extended naturally from single-threaded to concurrent contexts.\nlet (sender, receiver) = mpsc::channel(); let receiver = Arc::new(Mutex::new(receiver)); for id in 0..size { let receiver = Arc::clone(\u0026amp;receiver); // each worker shares the receiver\u0026#39;s reference count } Arc::clone() does not copy the value. It only increments the reference count. The type system makes this distinction explicit.\nGraceful Shutdown The Drop trait is called automatically when a value goes out of scope. I used it to clean up workers when the pool is destroyed.\nOrdering mattered. First, send a Terminate message to every worker. Then join each thread. Reversing this order risks deadlock — blocking on the first worker\u0026rsquo;s join while the remaining workers never receive the shutdown signal.\n// 1. send termination signals first for _ in \u0026amp;self.workers { self.sender.send(Message::Terminate)?; } // 2. then join for worker in \u0026amp;mut self.workers { if let Some(thread) = worker.thread.take() { thread.join()?; } } Declaring worker.thread as Option\u0026lt;JoinHandle\u0026lt;()\u0026gt;\u0026gt; was an idiomatic Rust pattern. take() extracts the handle, leaving None in its place. This prevents double-joining the same thread at the type level.\nTrait Bounds The thread pool\u0026rsquo;s generic type has three constraints.\nPool\u0026lt;T: FnOnce() + Send + \u0026#39;static\u0026gt; FnOnce means the closure is called exactly once. A job runs once on one worker and that is it. Send guarantees the closure can be safely transferred to another thread. 'static constrains the closure\u0026rsquo;s referenced values to live for the entire program. Since a thread\u0026rsquo;s lifetime is unpredictable, this prevents borrowed references from being freed prematurely.\nRemove any one of these three and the code will not compile. In Go, passing a closure to a goroutine has no such constraints. Race conditions are detected at runtime with the -race flag instead. Rust moves that verification to the compiler.\nRetrospective This project started from wanting to experience a low-level language. What I actually experienced was \u0026ldquo;safety enforced by the compiler\u0026rdquo; more than \u0026ldquo;low-level.\u0026rdquo;\nWithout composing Arc\u0026lt;Mutex\u0026lt;T\u0026gt;\u0026gt;, multiple threads cannot share a receiver. Without specifying FnOnce + Send + 'static, a closure cannot be sent to a thread. Without declaring Option\u0026lt;JoinHandle\u0026gt;, take() is unavailable. The compiler explains through error messages why each combination is necessary, and resolving them guarantees concurrency safety.\nWhat I ended up learning was the experience of a type system catching concurrency bugs before runtime.\nReferences rust-server GitHub Repository ","permalink":"https://wid-blog.github.io/en/posts/career/personal/rust-server-retrospective/","summary":"A record of implementing the multithreaded HTTP server from Rust Book Chapter 20, experiencing how ownership and concurrency safety are enforced at the type level.","title":"rust-server"},{"content":"Backend server development often uses TCP and UDP without much thought. HTTP APIs and WebSocket run on TCP. Voice/video streaming and DNS use UDP. When workloads with different reliability and performance requirements — chat servers, ad servers — enter the picture, understanding the transport layer becomes necessary.\nTransport Layer The transport layer handles data delivery between applications. While IP finds the route to a host, the transport layer determines which process on that host receives the data. Port numbers serve this purpose.\nBoth TCP and UDP run on top of IP. The difference is the choice between reliability and speed.\nTCP TCP (Transmission Control Protocol) is a connection-oriented protocol. It establishes a connection before sending data and retransmits on loss.\nConnection Establishment TCP establishes connections through a 3-way handshake.\nsequenceDiagram participant C as Client participant S as Server C-\u003e\u003eS: SYN (seq=x) Note right of S: SYN received, preparing S-\u003e\u003eC: SYN-ACK (seq=y, ack=x+1) Note left of C: SYN-ACK received C-\u003e\u003eS: ACK (ack=y+1) Note over C,S: Connection established The client sends a SYN packet along with its sequence number. The server responds with SYN-ACK, providing the server-side sequence number. The client sends ACK to complete the connection. The exchanged sequence numbers serve as reference points for tracking order in subsequent data transfer.\nConnection Termination Connection termination uses a 4-way handshake. TCP operates in full-duplex mode, so each direction must be closed separately.\nsequenceDiagram participant A as Initiator participant B as Peer A-\u003e\u003eB: FIN Note right of B: FIN received, inbound closed B-\u003e\u003eA: ACK Note over B: Finishes sending remaining data B-\u003e\u003eA: FIN Note left of A: FIN received, outbound closed A-\u003e\u003eB: ACK Note over A: Enters TIME_WAIT state Note over A,B: Connection released When one side sends FIN, it signals \u0026ldquo;no more data to send.\u0026rdquo; The peer responds with ACK, finishes sending its remaining data, then sends its own FIN. The initiator enters TIME_WAIT after sending the final ACK, allowing time for delayed packets that may still be in the network.\nSegmentation Data sent by the application is split into segments by TCP. A single send() call transmitting 4KB gets divided into multiple segments sized to the MSS.\nflowchart LR A[\"Application\\n4KB payload\"] --\u003e B[\"TCP\"] B --\u003e C[\"Segment 1\\nseq=1\\n1460 bytes\"] B --\u003e D[\"Segment 2\\nseq=1461\\n1460 bytes\"] B --\u003e E[\"Segment 3\\nseq=2921\\n1160 bytes\"] The receiving TCP reassembles arriving segments in sequence number order and delivers them to the application. Segmentation and reassembly happen transparently — the application sees only the original continuous byte stream.\nReliability Guarantees Segments can be lost or arrive out of order as they traverse the network. TCP guarantees data delivery through two mechanisms.\nOrdering: Each segment carries a sequence number. The receiver reassembles data in original order using these numbers. Even when segments arrive out of order, the application receives sorted data.\nRetransmission: After sending data, the sender waits for an ACK. If no ACK arrives within the RTO, Retransmission Timeout, it resends the data.\nACK numbers use cumulative acknowledgment. \u0026ldquo;ACK 3\u0026rdquo; means \u0026ldquo;everything before 3 has been received; expecting 3 next.\u0026rdquo;\nNormal Flow sequenceDiagram participant S as Sender participant R as Receiver S-\u003e\u003eR: Segment 1 R-\u003e\u003eS: ACK 2 S-\u003e\u003eR: Segment 2 R-\u003e\u003eS: ACK 3 When segments arrive in order, the receiver sends an ACK with the next expected number. ACK 2 means \u0026ldquo;received 1, expecting 2 next.\u0026rdquo;\nLoss Scenario sequenceDiagram participant S as Sender participant R as Receiver S-\u003e\u003eR: Segment 1, 2 S--xR: Segment 3 [lost] S-\u003e\u003eR: Segment 4 R-\u003e\u003eS: ACK 3 R-\u003e\u003eS: ACK 3 (dup) R-\u003e\u003eS: ACK 3 (dup) Note over S: 3 duplicates → loss detected S-\u003e\u003eR: Segment 3 [retransmitted] R-\u003e\u003eS: ACK 5 The receiver cannot advance the ACK number past 3 because Segment 3 is missing, even after Segment 4 arrives. It repeats ACK 3. When the sender sees 3 duplicate ACKs, it retransmits immediately without waiting for a timeout. Once the receiver gets the retransmitted Segment 3, it combines it with the buffered Segment 4 and sends ACK 5.\nFlow Control Exceeding the receiver\u0026rsquo;s processing capacity causes data loss. TCP prevents this with the sliding window mechanism.\nThe receiver advertises its available buffer size as the receive window, rwnd. The sender does not send more unacknowledged data than the rwnd allows.\nAwaiting ACKs:\nblock-beta columns 10 block:window[\"Send Window (rwnd = 4)\"]:4 s3[\"3 sent\"] s4[\"4 sent\"] s5[\"5 ready\"] s6[\"6 ready\"] end s7[\"7\"] s8[\"8\"] s9[\"9\"] s10[\"10\"] s11[\"11\"] s12[\"12\"] style s3 fill:#4CAF50 style s4 fill:#4CAF50 style s5 fill:#42A5F5 style s6 fill:#42A5F5 Segments 3 and 4 have been sent. Segments 5 and 6 are inside the window but not yet transmitted. Segment 7 onward is outside the window and cannot be sent.\nAfter receiving ACK 3 — window slides right:\nblock-beta columns 10 s3[\"3 ✓\"] block:window[\"Send Window (rwnd = 4)\"]:4 s4[\"4 sent\"] s5[\"5 sent\"] s6[\"6 ready\"] s7[\"7 ready\"] end s8[\"8\"] s9[\"9\"] s10[\"10\"] s11[\"11\"] s12[\"12\"] style s3 fill:#9E9E9E style s4 fill:#4CAF50 style s5 fill:#4CAF50 style s6 fill:#42A5F5 style s7 fill:#42A5F5 When ACK 3 returns, the window shifts one position right. Segment 3 moves out of the window as complete. Segment 7 enters the window. This repeats with each ACK. When the receiver\u0026rsquo;s buffer fills up, it sends rwnd=0 to pause transmission. When buffer space opens, it advertises a new window size to resume.\nThe sliding window overcomes the limitations of Stop-and-Wait, where only one packet can be in transit at a time. With sliding windows, multiple packets within the window range transmit continuously while awaiting ACKs.\nTwo retransmission strategies exist. Go-Back-N retransmits all packets from the lost one onward. Simple to implement but causes unnecessary retransmissions. Selective Repeat retransmits only the lost packets. Requires receiver-side buffering but improves network efficiency. TCP uses the Selective Repeat approach. The SACK option enables this.\nCongestion Control While flow control protects the receiver\u0026rsquo;s capacity, congestion control protects network path capacity. When the network is congested, router buffers overflow and packets are lost.\nTCP manages a variable called the congestion window, cwnd. The actual transmission rate is determined by the smaller of rwnd and cwnd.\nSlow Start At connection start, network capacity is unknown. cwnd begins at 1 MSS and increases by 1 MSS for each ACK received. cwnd doubles each RTT — exponential growth.\n--- config: xyChart: xAxis: label: \"RTT\" yAxis: label: \"cwnd (MSS)\" --- xychart-beta x-axis [\"0\", \"1\", \"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\", \"10\"] y-axis 0 --\u003e 40 line \"cwnd\" [1, 2, 4, 8, 16, 17, 18, 19, 20, 10, 11] When cwnd reaches ssthresh (slow start threshold), TCP switches to congestion avoidance. In the graph above, ssthresh (=16) is reached at RTT 4, after which growth becomes linear. At RTT 9, packet loss is detected and cwnd drops to half.\nCongestion Avoidance After ssthresh, cwnd increases by only 1 MSS per RTT — linear growth. Transmission volume increases gradually until packet loss is detected. This is the AIMD strategy — Additive Increase, Multiplicative Decrease. It probes available bandwidth gradually without overloading the network.\nFast Retransmit Two ways to detect packet loss: timeout (RTO expiry) and duplicate ACKs. Timeouts can take hundreds of milliseconds to seconds.\nFast retransmit triggers immediate retransmission when 3 duplicate ACKs arrive, without waiting for timeout. Duplicate ACKs themselves signal that \u0026ldquo;data after the lost packet is arriving, but something in between is missing.\u0026rdquo;\nFast Recovery Returning to slow start after fast retransmit causes a sharp throughput drop. Fast recovery prevents this. Instead of dropping cwnd to 1, it halves cwnd and resumes directly from congestion avoidance.\nArriving duplicate ACKs indicate the network is not completely blocked — some packets are getting through. No need to retreat all the way to slow start.\nflowchart TD A[Connection start] --\u003e B[Slow Startcwnd exponential growth] B --\u003e|cwnd \u003e= ssthresh| C[Congestion Avoidancecwnd linear growth] B --\u003e|Timeout| D[ssthresh = cwnd/2cwnd = 1 MSS] C --\u003e|3 dup ACKs| E[Fast Retransmit + Fast Recoveryssthresh = cwnd/2cwnd = ssthresh] C --\u003e|Timeout| D D --\u003e B E --\u003e C Timeout signals severe congestion — cwnd resets to 1 MSS. Loss detected via duplicate ACKs indicates milder congestion — cwnd halves.\nImplementation Variants The four core algorithms share the same skeleton, but specific behavior varies by implementation.\nTCP Reno: The first implementation to integrate slow start, congestion avoidance, fast retransmit, and fast recovery. Handles one packet loss per window efficiently.\nTCP NewReno: Addresses Reno\u0026rsquo;s limitation. When multiple packets are lost within a single window, known as partial ACKs, NewReno stays in fast recovery rather than falling back to slow start, retransmitting lost packets sequentially.\nTCP CUBIC: The default congestion control algorithm in Linux. Uses a cubic function instead of RTT-proportional linear increase during congestion avoidance. Utilizes available bandwidth faster on high-bandwidth, long-distance networks.\nBBR, Bottleneck Bandwidth and RTT: A model-based algorithm developed by Google. Determines transmission rate based on measured bandwidth and RTT rather than packet loss. Estimates the actual bottleneck bandwidth of the network path and achieves maximum throughput without excessively filling router buffers. A substantial share of Internet traffic is reported to use BBR.\nUDP UDP (User Datagram Protocol) is a connectionless protocol. It transmits data immediately without a handshake.\nStructure The UDP header is 8 bytes: source port, destination port, length, and checksum. Compared to TCP\u0026rsquo;s minimum 20-byte header, the overhead is minimal.\nblock-beta columns 4 block:tcp[\"TCP Header (20+ bytes)\"]:4 t1[\"Source Port\"] t2[\"Dest Port\"] t3[\"Sequence Number\"] t4[\"ACK Number\"] t5[\"Flags\"] t6[\"Window Size\"] t7[\"Checksum\"] t8[\"Options...\"] end space:4 block:udp[\"UDP Header (8 bytes)\"]:4 u1[\"Source Port\"] u2[\"Dest Port\"] u3[\"Length\"] u4[\"Checksum\"] end style tcp fill:#E3F2FD style udp fill:#E8F5E9 No ordering. No retransmission. Lost packets are the application\u0026rsquo;s responsibility. No flow control or congestion control either.\nWhy It Exists TCP\u0026rsquo;s reliability comes with latency. The 3-way handshake takes at least 1 RTT. Retransmission adds more delay. Congestion control may throttle transmission speed.\nFor real-time workloads, this latency matters more than reliability. In a voice call, audio arriving 0.5 seconds late via retransmission disrupts the conversation. In a game, a stale position update arriving late is useless. In these cases, dropping lost data beats retransmitting it.\nComparison flowchart LR subgraph TCP direction TB tc1[Connection-oriented] tc2[Ordered delivery] tc3[Retransmission] tc4[Flow/Congestion control] tc5[20+ byte header] end subgraph UDP direction TB uc1[Connectionless] uc2[No ordering] uc3[No retransmission] uc4[No control] uc5[8 byte header] end TCP --- Reliability UDP --- Speed When to Choose Which TCP fits when:\nData integrity is essential: web traffic (HTTP/HTTPS), file transfer (FTP), email (SMTP) Database communication: query results must arrive correctly API calls: lost requests or responses are unacceptable UDP fits when:\nReal-time streaming: video, voice calls (VoIP) Online gaming: position and state updates DNS: small request/response exchanges that need speed IoT sensor data: periodic transmission, some loss acceptable QUIC: The foundation protocol for HTTP/3. It implements TCP-like reliability — retransmission, ordering — and TLS encryption on top of UDP. It reduces the combined latency of TCP\u0026rsquo;s 3-way handshake plus TLS handshake while preserving reliability. Unlike TCP, which is implemented in the OS kernel and difficult to modify, QUIC operates at the application level, enabling faster iteration.\nIt is easy to overlook the transport layer in backend development. But why HTTP runs on TCP, why WebSocket chose TCP, and why DNS uses UDP all start here. In the end, the protocol choice comes down to whether the workload needs reliability more or needs to cut latency more.\n","permalink":"https://wid-blog.github.io/en/posts/tech/network/tcp-udp/","summary":"Two transport protocols that backend developers encounter constantly. A summary of TCP and UDP — connection establishment, reliability guarantees, flow/congestion control mechanisms, and selection criteria.","title":"TCP and UDP"},{"content":"A message came through a friend. A team was building a meme service using deepfake face synthesis and needed a developer. The pitch was interesting — an app that synthesizes a user\u0026rsquo;s face onto GIF memes to create personalized content. In March 2021, I joined the four-person startup team.\nThe Service SwapDo was a deepfake-based face synthesis meme creation service, supporting both Android and iOS.\nThe core feature was face synthesis. Users selected a GIF or image from the app\u0026rsquo;s content library, and their face was composited onto it to create a new meme. Synthesis ran in the background, so users could browse other content while waiting. A push notification arrived upon completion.\nThere was also a virtual plastic surgery feature — users could pick eyes, nose, or mouth from celebrities and composite them onto their own face. A meme world cup let users vote on themed content in tournament brackets. The community board allowed sharing created memes and commenting.\nMy Role I served as team lead for four months, managing the team while developing simultaneously. By contribution, roughly 80% backend server development, 30% Android app development, and 10% synthesis technology.\nArchitecture The service architecture was straightforward. Android/iOS clients called REST APIs on an Apache-based backend server. The backend was written in PHP, with data stored in MariaDB. Sentry handled error tracking.\nFace synthesis ran in a separate environment. When the backend received a synthesis request, it invoked a Python script in an Anaconda virtual environment. Upon completion, the script returned the result file path to the backend.\nflowchart LR subgraph Client Android iOS end Client -- \"Error logs\" --\u003e Sentry Client -- \"Request (CRUD)\" --\u003e Backend Backend -- \"Response (JSON)\" --\u003e Client Backend[\"Apache\\nBackend Server\\n(PHP)\"] Backend -- \"Data R/W\" --\u003e MariaDB[(MariaDB)] Backend -- \"Synthesis request\" --\u003e Anaconda[\"Anaconda\\nVirtual Env\\n(Python)\"] Anaconda -- \"Result path\" --\u003e Backend Anaconda -- \"Synthesis result\" --\u003e Storage[(\"File\\nStorage\")] The synthesis pipeline involved multiple steps: recognizing the user\u0026rsquo;s face, extracting facial landmarks, pulling frame data from the GIF content, performing 3D modeling for face synthesis, refining boundaries and skin tone, compositing frame by frame, and encoding the result back into GIF format. Libraries included OpenCV, Dlib, and FaceAlignment.\nflowchart LR A[\"Face\\nrecognition\"] --\u003e B[\"Landmark\\nextraction\"] B --\u003e C[\"GIF frame\\nextraction\"] C --\u003e D[\"3D modeling \u0026\\nface synthesis\"] D --\u003e E[\"Boundary \u0026\\nskin tone\\nrefinement\"] E --\u003e F[\"Per-frame\\ncompositing\"] F --\u003e G[\"GIF\\nencoding\"] G --\u003e H[\"Result\\ndelivery\"] Face recognition Landmark extraction 3D modeling Face synthesis Result Technical Contributions Backend Refactoring When I joined, the backend code had all logic in a single file. I introduced the MVC pattern and restructured the code with OOP principles. Separating request handling, business logic, and data access into distinct layers improved code readability and reduced response time by about 10%.\nAndroid Infinite Scroll The content list scroll performance was poor — noticeable stuttering during scrolling. I improved the infinite scroll logic, achieving roughly 60% faster scroll speed. The gains came from using Glide for image loading and refining RecyclerView\u0026rsquo;s recycling logic.\nAsynchronous Synthesis Requests Face synthesis required server processing time. The app could not freeze while users waited for results. I used Android\u0026rsquo;s Service component to handle synthesis requests asynchronously. While synthesis ran in the background, users could explore other content. FCM push notifications informed them when results were ready.\nRetrospective This was my first time joining a startup team. Rather than building to a handed-down spec, I was shaping what the product should be while writing the code. Leading the team was also a first. Balancing development and team management was not easy, but I came to see the product from a broader perspective.\nTechnically, I gained experience in PHP-based REST API design, Android app performance optimization, and background processing patterns. Working across backend and mobile in a small team built an intuition for understanding end-to-end service flows.\nIn July 2021, the project wrapped up naturally. Five months that started with a single message from a friend. Looking back, the experience of building a product together was the most valuable lesson.\n","permalink":"https://wid-blog.github.io/en/posts/career/startup/swapdo-startup-retrospective/","summary":"SwapDo, a deepfake-based face synthesis meme service. A record of five months as a developer and team lead in a startup team.","title":"SwapDo Startup Story"},{"content":"When implementing authentication in a web service, you choose between sessions and JWT. Sessions let the server manage state for immediate control. JWT keeps no state on the server, favoring horizontal scaling. The core question is \u0026ldquo;where does the authentication state live.\u0026rdquo;\nWhen a logged-in user requests the next page, the server does not know who they are. HTTP is stateless. To maintain authentication, state must be stored somewhere.\nSession Authentication Session authentication means the server directly manages the user\u0026rsquo;s authentication state.\nWhen a user logs in, the server creates session data and issues a unique session ID. This session ID is delivered to the client via a cookie. On each subsequent request, the cookie carries the session ID, and the server looks it up in the session store to identify the user.\nSession data is stored in server memory, the file system, or an external store like Redis.\nStrengths The server manages sessions directly, making control straightforward. A specific user\u0026rsquo;s session can be invalidated immediately. Features like forced logout or concurrent session limits are easy to implement.\nOnly the session ID reaches the client, so user information faces less exposure risk over the network.\nLimitations Storing state on the server constrains horizontal scaling. When multiple server instances run, a user routed to a different instance cannot find their session. Solving this requires sticky sessions or a shared session store like Redis.\nAs user count grows, session store load grows with it.\nJWT JWT (JSON Web Token) embeds authentication information in the token itself. The server stores no state.\nStructure A JWT consists of three parts, separated by dots.\nheader.payload.signature\nThe header specifies the token type and signing algorithm. The payload contains claims such as user identification and expiration time. The signature is the header and payload signed with a secret key.\nWhen the server receives a token, it verifies the signature. If the payload has been tampered with, the signature will not match, revealing forgery. No session store lookup is needed.\nStrengths The server stores no state, so horizontal scaling is unrestricted. Any server instance can verify the token. No separate session store is required.\nLimitations Once issued, a token is difficult to invalidate before expiration. The server holds no token state. Forced logout requires a separate blacklist store, in which case the stateless advantage diminishes.\nThe payload is Base64-encoded, not encrypted. Sensitive information must not be placed in the payload.\nToken size exceeds a session ID. Since the token is transmitted with every request, network overhead is larger than the session approach.\nAccess Token and Refresh Token With JWT, issuing a single token creates a trade-off in expiration settings. A long expiration is risky if the token is stolen. A short expiration forces users to re-authenticate frequently.\nSplitting the token into two solves this.\nThe access token authenticates API requests. Its expiration is set short. Even if stolen, it expires quickly.\nThe refresh token is used to obtain a new access token. Its expiration is relatively long. The server can store and manage it, enabling invalidation when needed.\nThe renewal flow:\nThe client sends an API request with the access token. When the access token expires, the server returns 401. The client requests a new access token using the refresh token. The server validates the refresh token and issues a new access token. Storage Strategies Session IDs are stored in cookies by convention. JWT has multiple options, and each carries different security characteristics.\nMemory Stored in a JavaScript variable. Lost on page refresh. Not exposed to XSS attacks, but requires re-authentication on every refresh.\nlocalStorage Stored in the browser\u0026rsquo;s storage. Persists through refreshes. However, JavaScript can access it, making it vulnerable to XSS. If XSS occurs, the token can be stolen.\ncookie (HttpOnly) Setting the HttpOnly attribute prevents JavaScript access. This blocks direct token theft via XSS. However, CSRF attacks require separate mitigation. Combining the SameSite attribute with a CSRF token is standard practice.\nStorage Survives Refresh XSS Resistance CSRF Resistance Memory No Not exposed N/A localStorage Yes Vulnerable N/A cookie (HttpOnly) Yes Not exposed Requires mitigation A common combination stores the access token in memory and the refresh token in an HttpOnly cookie. The access token\u0026rsquo;s short lifespan limits exposure risk. HttpOnly on the refresh token blocks XSS theft.\nComparison Aspect Session JWT State storage Server Client (token) Horizontal scaling Shared store needed Unrestricted Immediate invalidation Easy Difficult (blacklist needed) Network size Small (session ID) Large (full token) Server load Store lookup Signature verification (CPU) Service structure and requirements determine the choice. Whether immediate invalidation is essential, and whether the cost of sharing state across servers is acceptable, are the key decision points. When immediate control matters in a single-server environment, sessions fit. When authentication must work across distributed servers without shared state, JWT fits.\n","permalink":"https://wid-blog.github.io/en/posts/tech/security/session-and-jwt/","summary":"HTTP is stateless. Maintaining user authentication requires storing state somewhere. This post covers the structure, trade-offs, and storage strategies of server-side sessions and client-side JWT tokens.","title":"Session Authentication and JWT"},{"content":"There was a game I played from time to time. League of Legends (LoL). Watching pro matches, one thing stood out: teams drafting bans and picks, strategizing, scheduling scrims. Regular players had no platform to do any of this.\nIn March 2021, I started building one with a colleague. 55L, short for 5vs5 League. It later became known as GGScrim.\nThe Service 55L had two components.\nggscrim.com was a team matching platform. Teams register and find opponents to schedule scrims. banpick.kr was a virtual ban/pick simulator. Users could simulate the same ban/pick sequence as pro matches.\nThe ban/pick service attracted over 10,000 users.\nTech Choices The API server ran on PHP. I had prior experience building servers with PHP, and starting fast was the priority. I applied MVC with DI, Factory, and Singleton patterns, with Nginx as a reverse proxy.\nChat and notifications required real-time bidirectional communication. I set up a separate server with Node.js + Socket.io — HTTP request/response alone could not meet this requirement.\nMariaDB handled data storage. Redis managed authentication tokens.\nThe client was a TypeScript + Lit SPA. Among Web Components frameworks at the time, Lit was the most concise and ran on standard APIs, which I believed would ensure long-term sustainability. I used Webpack for builds and Firebase for hosting.\nThe desktop app was built with Electron. It needed to communicate with the LoL client via sockets, and browsers cannot directly access local processes. Electron solved this constraint. It reused the web codebase and covered both Windows and Mac from a single codebase.\nMobile was handled through a PWA. It allowed installable delivery without a separate native app. I considered it the right way to secure mobile accessibility while reducing development cost.\nArchitecture The overall system split into three areas.\nClients came in three forms: desktop (Electron), web (SPA), and mobile (PWA). Only the desktop app communicated directly with the LoL client via sockets. The rest were browser-based. All three connected to the same API server and Socket.io server.\nOn the server side, I separated the auth server from the API server. Authentication used JWT. Separating auth logic from business logic meant each could be modified and deployed independently. I covered the JWT auth structure and how it compares to session-based auth in a separate post.\nThe API server communicated with the Riot API to fetch champion and summoner data, storing it in MariaDB.\nThe Team It started with two people. I handled all of the service architecture design, the JWT auth server, and the desktop app. I also covered most of the web client.\nAs the service grew, so did the team. Over five months, four people joined. Roles began to split, and the structure shifted from one person making every technical decision to a shared model.\nLooking Back It was an environment where every technical decision fell on me. I had to build my own rationale for each choice and face the trade-offs directly.\nThe Lit choice is one I would change if I did it again. I picked it for the longevity of Web Components standards, but considering ecosystem size and development speed, React would have been a better choice. At the time I prioritized the technology\u0026rsquo;s direction, but in an early-stage startup, speed matters more.\nThe project started from a game I played from time to time. I built a platform that did not exist, and the reference points I gained from that process still serve me in production work today.\nReferences Session Authentication and JWT ","permalink":"https://wid-blog.github.io/en/posts/career/startup/55l-ggs-startup-retrospective/","summary":"A startup born from a casual League of Legends habit. From architecture design to desktop apps, a record of two people building a service that grew to 10,000 users over five months.","title":"55L(GGS) Startup Story"},{"content":"It was a time when everyone was stuck at home due to COVID. I thought it would be nice to watch YouTube together with friends even when apart. I found the Android 11 Hackathon hosted by GDG Korea Android — a roughly three-week event. I entered solo.\nThe Service YouTube Together was an app for watching YouTube videos simultaneously with friends, remotely.\nThe YouTube viewing feature worked like the official YouTube app. Users searched and played videos through the YouTube API. A mini player allowed browsing other content while a video played.\nThe core was simultaneous viewing. After adding friends and selecting a video, playback started for everyone at once. Playback position and play/pause state synced in real time. Chat let participants talk while watching.\nArchitecture Finishing in three weeks required a proven structure. The app followed the MVVM pattern — Single Activity with Fragments, dependency injection via Hilt. ViewModel and LiveData managed UI state, and the Repository pattern separated local (SQLite) and remote data sources.\nTwo servers handled the backend: a Spring-based API server for REST calls, and a Java socket server for real-time synchronization.\nflowchart TD Hilt[\"Hilt\\n(DI)\"] -. Dependency injection .-\u003e App subgraph App[\"Application\"] Activity[\"Single Activity\\n+ Fragment\"] --\u003e VM[\"ViewModel\\n+ LiveData\"] VM --\u003e Repo[\"Repository\"] end Repo --\u003e Local[\"Local Model\\n(SQLite)\"] Repo --\u003e Remote[\"Remote Data Model\"] Remote --\u003e Socket[\"Socket Server\\n(Java)\"] Remote --\u003e API[\"API Server\\n(Spring)\"] Core Implementation YouTube API Integration I implemented video search and playback using the YouTube Data API. The playback screen used Motion Layout for mini player transitions. Swiping down minimized to a mini player; tapping again returned to full screen.\nReal-time Simultaneous Viewing The key to simultaneous viewing was playback state synchronization. I built a socket server in Java that synced playback position and play/pause state across participants in real time. When one user changed position, everyone else\u0026rsquo;s video jumped to the same point. Chat ran through the same socket connection.\nRetrospective Three weeks might not sound short, but it was tight for building both server and app alone. I had to juggle API server design, socket server implementation, and Android app development all at once. Scoping down and focusing on the core was the key — I cut community features and recommendations, concentrating on simultaneous viewing alone.\nI won the Daedoseogwan Award. Three weeks that started from one idea — watching together even when apart. Looking back, the core lesson of the hackathon was the experience of scoping down under time pressure and carrying it through to completion.\n","permalink":"https://wid-blog.github.io/en/posts/career/hackathon/gdg-korea-android-11-hackathon/","summary":"Entering the GDG Korea Android 11 Hackathon solo, building both server and app in three weeks.","title":"GDG Korea Android 11 Hackathon — YouTube Together"},{"content":"Passing through subway stations, I kept noticing spaces most people overlooked. Galleries, rest corners, performance stages. Spaces walked past daily but rarely known. The Korea Railroad Industry Information Center was running a hackathon — \u0026ldquo;Public Data Utilization for Station Convenience Information.\u0026rdquo; Three weeks, public-data based. I joined as team lead with two teammates.\nService Hidden Rest Areas is an Android app that surfaces the lesser-known rest spaces inside subway stations.\nThe app held four features: per-station rest spaces with photo, location, and rating; performances held inside stations with their schedules; per-station open chat rooms; and general facility information — restrooms, nursing rooms, cultural spaces.\nThe last one organized material already in the public dataset into a single view. The first three were our own content. Hidden rest area entries accepted user ratings, which fed the recommendation flow naturally.\nArchitecture Three weeks meant going with a proven stack. The design stayed simple: Android app + single API server + RDB.\nflowchart TD App[\"Android App\\n(Java)\"] --\u003e API[\"API Server\\n(PHP + Apache)\"] API --\u003e DB[(MariaDB)] API -. FCM .-\u003e App App --\u003e Maps[\"Google Maps API\"] App --\u003e PublicData[\"KRIC Public Data\"] The server ran on PHP and Apache. I had worked with this stack before, which mattered when the goal was stable operation under a short deadline. MVC and Singleton patterns shaped the code; MariaDB held station and rest-space data.\nThe app was built in Java. Google Maps API rendered station locations, Glide handled image caching, and Lottie animations covered small interactions. New rest area and performance alerts went out through FCM.\nI worked on both the server and the Android app. The two teammates split data ingestion and additional client screens in parallel.\nKey implementation Public data integration The hackathon\u0026rsquo;s center was transforming public data into something useful for the user. We pulled station convenience data from the Korea Railroad Industry Information Center, transformed it, and loaded it into our own database. The source structure did not map directly onto the screens, so an intermediate transformation step made sense. Our own content — performances, chat — lived in separate tables.\nLibrary choices With three weeks, library choice was effectively schedule management. Google Maps gave us the location UI as a drop-in. Glide reduced image-handling code. FCM let the server push alerts directly. Letting libraries cover the easy-to-miss spots freed time for the features that mattered.\nRetrospective The submission was selected. A short window with public-data transformation woven into our own content seemed to be what landed.\nAs team lead, scope and pacing were the heaviest part. The pull to add more features was strong, but reaching a demo-ready state in three weeks meant stopping at some point. Locking the four core features and refusing to extend was probably the single biggest factor in hitting the deadline.\nIt was also my first time working with public data directly. Reshaping external data into our own domain took more time than expected. Next time something similar shows up, I would start with the transformation layer before anything else.\n","permalink":"https://wid-blog.github.io/en/posts/career/hackathon/kric-station-public-data-hackathon/","summary":"Korea Railroad Industry Information Center\u0026rsquo;s public-data hackathon. A three-person team built an Android app that surfaces hidden rest spaces inside subway stations, over three weeks.","title":"KRIC Station Public Data Hackathon — Hidden Rest Areas"},{"content":"Backend engineer. Writing about technology and lessons learned along the way.\n","permalink":"https://wid-blog.github.io/en/about/","summary":"\u003cp\u003eBackend engineer. Writing about technology and lessons learned along the way.\u003c/p\u003e","title":"About"}]