Not hiring a traditional HR tech stack for a team of 12 can cost between $800 and $1,200 monthly. Lever for ATS, Lattice for performance, BambooHR for onboarding, and Rippling for payroll are excellent tools; however, they can be inadequate if your team is spread across three continents, working in AI sprints that change every two weeks, and linking individual performance to specific model releases is crucial. The solution isn’t adding more SaaS: the key is to build your own system on Notion.
Photo: Steve A Johnson on Unsplash
I’m not talking about using Notion simply as a glorified repository for HR policies. I'm referring to a real architecture: relational databases, automation via the API, bidirectional integrations with your code repositories, feedback pipelines linked to model metrics, and an evaluation system that understands that "team contribution" in an AI lab doesn't mean the same as in a traditional company. This is what we've learned by implementing it in three startups over the past 18 months.
Why Traditional HR Doesn’t Work for AI Teams
Conventional talent management tools are designed for organizations with stable roles and predictable metrics. For example, a salesperson has a quota, a product manager has shipped features, and a customer success representative has their NPS score. But how do you evaluate a research scientist whose work for six months was exploring an architecture that you ultimately discarded? How do you measure the contribution of someone who improved the training pipeline and reduced compute costs by 40% but didn’t "ship" any model?
The problem goes beyond metrics. Development cycles in AI are key and fundamentally different. A team can pivot from fine-tuning to RAG in just two weeks. Moreover, roles overlap: an ML engineer might be doing data engineering on Monday, tuning hyperparameters on Wednesday, and writing technical documentation on Friday. Thus, quarterly objectives lose their meaning, especially when an arXiv paper can invalidate three months of work in a single afternoon.
AI startups that scale successfully build their own systems, not because it's trendy, but because they need to link talent management with specific technical realities: experiment sprints, model metrics, inference costs, and technical debt in datasets. Notion allows all of this without requiring you to become a full-time developer of internal tools.
The Base Architecture: Five Relational Databases
Photo: Igor Omilaev on Unsplash
The core of the system consists of five interconnected databases in Notion, all linked through bidirectional relationships. This is not a casual setup: it's the minimum viable configuration to gain real visibility into your team.
People Database: This is the heart of the system. Each person is a page with structured properties: current role (with history linked to another database), start date, level (ranging from IC1 to Staff/Principal), tech stack (multi-select updated monthly), location, timezone, and direct manager. The critical part here is the calculated fields: automatic tenure, time since last promotion, and time distribution among projects (through rollup from another database).
Projects & Experiments Database: Here is where all technical work resides. Each project is a page linked to involved individuals (with assigned time percentage), GitHub repositories (via API), target metrics (accuracy, latency, and cost per inference), current status, kickoff date, and deadline. AI sprints are fixed-duration projects lasting 2-3 weeks. While this database integrates with Linear or Jira via Zapier, the source of truth resides in Notion.
Reviews & Feedback Database: Formal evaluations occur every six months, but ad-hoc feedback linked to specific projects is also included. Each review has relationships to the evaluated person, the evaluator, completed projects during the period, evaluated technical competencies (rated from 1 to 5), and identified areas for development. The value here is being able to filter: "all reviews of ML engineers in the past 12 months where 'technical communication' scored below 3."
Skills Matrix Database: This is your team's technical inventory. Each skill is a page (PyTorch, Kubernetes, Transformer architectures, RAG systems) with a proficiency level for each person. This is not self-reported: it gets updated during formal reviews and validated with actual contributions to projects. It allows you to answer questions like: "Who can review a CUDA optimization PR?" or "Do we have internal coverage for fine-tuning Llama?"
Career Progression Database: This sets the level matrix and expectations. It defines what it means to be IC2, IC3, Senior, or Staff in your specific startup. It includes technical responsibilities, scope of impact, expected autonomy, and mentorship skills. Each person in the People Database has a relationship to their current level here, and you can compare current profiles against the requirements for the next level.
Automations That Turn Data into Decisions
Databases are fundamental infrastructure. However, the real value emerges when you automate flows that would be impossible in traditional tools.
Automatic Sync with GitHub: Through the Notion API and GitHub webhooks, each merged PR automatically updates the code contribution of the author on their People page. We use GitHub Actions to extract data like lines of code (which are irrelevant as an absolute metric but useful for detecting anomalies), the number of reviewed PRs, average review time, and modified files. This feeds performance conversations with objective data: "In the last three months, you've primarily been on infrastructure, but your career plan points towards architecture. Do you need more core model projects?"
Talent Distribution Dashboard: A rollup in the Projects database shows in real-time what percentage of your team is in research, production, or infrastructure, how many people are overallocated (more than 100% assigned time), and which projects depend on a single person (bus factor = 1). This is critical for AI startups where it's easy for everyone to want to work on "the new model" and no one on "maintaining the data pipeline."
Turnover Risk Alerts: Through formulas in Notion, we calculate a basic "engagement score" that considers days since the last 1-on-1 with the manager, time since the last documented positive feedback, whether the person has expressed interest in a role change (a boolean field that managers update), and if they have been without a promotion or significant change in responsibilities for over 18 months. When this score crosses a threshold, a task is automatically created in the manager's database.
Calibration Reports for Reviews: Before each evaluation cycle, we generate filtered views in Notion that show the distribution of ratings from the previous period. This prevents each manager from calibrating in isolation. You can see: "In the last cycle, 60% of the ML team was rated 'exceeds expectations.' Is that realistic? Or are we inflating ratings?" In my experience, AI startups tend to overvalue pure technical contributions and undervalue work that doesn't ship models but keeps operations running.
Critical Integrations That Close the Loop
Using Notion as an isolated system is like having an expensive wiki. The true power emerges from connecting it with your real work tools.
Slack + Notion API for Structured 1-on-1s: We built a Slack bot (with 200 lines of Python, hosted on Cloud Run for $3 a month) that weekly sends each manager-report pair a private thread with a template for 1-on-1s. Notes are archived directly in the person's page in the People Database. This solves the issue of "managers not documenting conversations" without imposing new tools on them. The data in Notion allows searches like: "When was the last time we discussed career progression with Ana?"
Linear/Jira → Notion to Link Delivery with Performance: Through Zapier, although it's an anti-pattern to use no-code for this, each closed issue in Linear updates a field in the Projects Database. This allows you to evaluate not only "shipped features" but also context: How many issues were bug fixes versus features? How much time was spent on tech debt? In AI teams, closing 50 issues for "adjusting classification threshold" is qualitatively different from closing 5 issues for "implementing new retrieval architecture."
Google Calendar API for Time Balance: We integrated team calendars (with their explicit consent) to extract metadata: hours in meetings versus deep work and distribution among 1-on-1s, all-hands, and external meetings. This feeds productivity conversations: "You're in meetings over 25 hours a week. Is that sustainable for an IC role? Should we rethink your responsibilities?" For research roles, a high percentage of calendar time often correlates with low technical output.
Notion → Data Warehouse for Advanced Analytics: If your startup already has a data warehouse (like BigQuery or Snowflake), use the Notion API to export weekly snapshots of your databases. This allows for analyses that Notion doesn't natively support, like cohort analysis ("How long does it typically take someone to go from IC2 to IC3?"), retention modeling ("What factors predict turnover?"), and sentiment analysis on 1-on-1 notes (with LLMs, of course). A 30-person startup doesn’t need this; a 100+ person one does.
The Evaluation System That Reflects Real AI Work
Traditional performance reviews often ask: "Did you meet your quarterly objectives?" In AI, the more relevant question is: "Did you make the right technical decisions given the context of information available, even if the final outcome wasn’t as expected?"
Our framework in Notion evaluates five dimensions, each supported by specific evidence from the Projects Database:
Technical Execution: It’s not just about "lines of code" or "models shipped." The evaluation considers: Did you identify the right approach to the problem? Is your code maintainable? Did you document design decisions? Did you detect and communicate blockers early? Evidence comes from specific PRs, written design docs, and post-mortems of failed experiments.
Impact Scope: Did your work affect only your project (IC2), multiple projects (IC3), the technical direction of the team (Senior), or the product strategy (Staff)? In AI startups, someone who optimizes the training pipeline and reduces costs by 40% has a greater impact than someone who adds three features to a model. The system in Notion links each project to predefined "areas of impact": cost, latency, accuracy, developer experience, and user experience.
Collaboration & Knowledge Sharing: How many PRs did you review? Did you write technical documentation that others use? Did you mentor a junior? This is partially extracted from GitHub (number of PRs reviewed) and partially from qualitative feedback in the Reviews Database. This is critical in AI teams where accumulated knowledge can hinder the speed of the entire team.
Judgment & Decision-Making: This dimension is the most complicated to measure but most crucial in AI. Did you propose using RAG when fine-tuning would have been excessive? Did you identify that the issue was data quality and not model architecture? This is evaluated retrospectively: we document in the Projects Database the initial decision, the actual outcome, and whether the reasoning was sound. Someone who proposed an approach that didn't work but had solid reasoning scores better than someone who got it right by luck.
Autonomy & Ownership: Do you need constant direction (IC1), work independently within a defined scope (IC3), define your own scope (Senior), or identify gaps in the team's strategy without anyone asking you (Staff)? This is inferred from patterns in 1-on-1 notes: if your manager is redirecting your work every week, you’re not operating at the level your title suggests.
Each dimension is rated on a scale of 1 to 5, but the numbers are secondary. The most valuable aspect is the structured discussion that Notion facilitates: the manager pre-fills evidence from the Projects Database, the report adds their perspective, and the review conversation focuses on the gaps between expectations and reality.
What We Learned Implementing This in Production
First Lesson: The initial setup requires 20 to 30 hours of work from someone technical who understands both Notion and your startup's operations. It’s not a plug-and-play process. Budget this time, or you’ll fail in the implementation. We realized that involving a senior IC in the design (not just the HR lead) is critical: they understand what data really matters and what is just theater.
Second Lesson: Complex automations tend to break frequently. The Notion API is solid enough for production, but it’s not as reliable as Stripe. So, plan for monthly maintenance. Some lessons we learned the hard way include always validating that relational properties exist before creating new pages, or your script will fail silently.