This is a remote position.
What the engineer will actually do:- P1| Build and schedule Python parsers that extract structured JSON from PowerPoint, PDF, and Excel documents, then land the data in Databricks Bronze → Silver tables.
- P1| Develop/maintain simple Auto Loader or Fivetran pipelines for ERP and ticketing systems.
- P2| Add basic text‑embedding or LLM‑based entity extraction (LangChain or open‑source transformers) to enrich the document feed.
- P3| Write unit tests and lightweight data‑quality checks (Great Expectations) so parsing errors do not break the pipeline.
- P3| Produce concise handover docs for our future data architect.
Skill Set:
Must‑have (core):
- 2‑4 years building ETL or ELT pipelines with DatabricksorSnowflake (Delta/Parquet, Spark SQL, Airflow or similar).
- Solid Python (pandas, PySpark) and experience parsing Office files with libraries such as python‑pptx, openpyxl, pdfplumber, or PyPDF.
- Basic SQL tuning and ability to work with structured schemas.
- Git and CI/CD familiarity.
- Exposure to LangChain, Hugging Face Transformer, or any LLM inference workflow.
- Experience adding embeddings to tables for downstream ML or search.
- Great Expectations or similar data‑quality tooling.
- Familiarity with Unity Catalog or Snowflake RBAC concepts.