Distributed data collection and integration platform for multi-source web data
河北华网 operates a data infrastructure stack built on Java, Hadoop, Spark, and Flink—with emerging adoption of ClickHouse, Hudi, and SeaTunnel—designed for large-scale web data collection and heterogeneous source integration. The company's active project mix (distributed crawlers, high-availability collection platforms, data governance, API layers) and pain-point focus on multi-source integration and data quality suggest a platform positioned between raw data acquisition and downstream analytics consumption.
河北华网 is a software development company headquartered in Shijiazhuang, Hebei Province, China. The company builds distributed systems for web data collection, integration, and governance—serving use cases that require ingesting, unifying, and quality-assuring data from heterogeneous sources. The technical foundation spans open-source big-data tooling (Hadoop, Spark, Flink, Kafka, HBase) alongside web-scraping frameworks (Scrapy, Pyspider, Nutch) and modern data pipeline tools (SeaTunnel, DolphinScheduler). Current operations center on crawler infrastructure, collection platform reliability, data governance implementation, and API surface development.
Java, Python, Hadoop, Spark, Flink, Kafka, HBase, ClickHouse, Hudi, SeaTunnel, DolphinScheduler, plus web-scraping tools (Scrapy, Pyspider, Nutch). Data storage spans SQLite, MySQL, Oracle, and HDFS.
Distributed crawler system, high-availability data collection platform, data governance implementation, heterogeneous data integration, data application system development, and network data capture platform.
Yes, 6 active roles across data and engineering teams, primarily at mid-level and manager seniority. Hiring is currently limited to China.
Other companies in the same industry, closest in size