#archiveteam-ot 2018-08-17,Fri

↑back Search

Time Nickname Message
00:15 🔗 Stiletto has joined #archiveteam-ot
00:20 🔗 Stilett0 has quit IRC (Ping timeout: 600 seconds)
01:06 🔗 ivan I noticed that people (me included) are still using grab-site and wpull 1.2.3 to crawl websites, despite all kinds of issues. it would probably be good to have a replacement at some point. there is already a chrome-based crawler, maybe it could be adapted to do more of the general website crawling that people expect to be able to do?
01:07 🔗 Flashfire https://chrome.google.com/webstore/detail/crawler/pdhedahnmicgeobhejclpbogemgkfpdo
01:07 🔗 Flashfire this one grabs images
01:07 🔗 Flashfire https://chrome.google.com/webstore/detail/site-spider-mark-ii/gedjofgioahckekhpgknhchelbpdogok
01:07 🔗 Flashfire this one works ok ivan
01:08 🔗 ivan the kind of crawling we need is something that can handle millions of pages with an on-disk queue (with stop/resume actually working well this time, I hope)
01:08 🔗 Flashfire Well I dont know of any chome extensions to do that
01:12 🔗 Stiletto has quit IRC (Ping timeout: 246 seconds)
01:24 🔗 * ivan notices some JavaScript WARC writing capabilities in https://github.com/N0taN3rd/node-warc
01:24 🔗 ivan https://github.com/N0taN3rd/Squidwarc
01:32 🔗 BlueMax has joined #archiveteam-ot
01:36 🔗 Stilett0 has joined #archiveteam-ot
02:18 🔗 dxrt- is now known as dxrt
02:18 🔗 dxrt has quit IRC (Quit: ZNC - http://znc.sourceforge.net)
02:18 🔗 dxrt has joined #archiveteam-ot
02:18 🔗 svchfoo1 sets mode: +o dxrt
02:25 🔗 Stilett0 has quit IRC (Ping timeout: 246 seconds)
02:35 🔗 Flashfire Does anyone have a spare spine and a skip full of painkillers?
02:50 🔗 Stilett0 has joined #archiveteam-ot
03:32 🔗 Stilett0 has quit IRC ()
03:41 🔗 Stilett0 has joined #archiveteam-ot
04:38 🔗 MrRadar has quit IRC (Read error: Connection reset by peer)
04:40 🔗 Soni has quit IRC (Ping timeout: 264 seconds)
04:41 🔗 dxrt has quit IRC (Ping timeout: 360 seconds)
04:41 🔗 dxrt has joined #archiveteam-ot
04:41 🔗 Soni has joined #archiveteam-ot
04:42 🔗 Flashfire https://www.japantimes.co.jp/news/2018/08/16/world/boy-11-hacks-replica-u-s-election-website-minutes-convention-probes-electronic-voting-systems/
04:47 🔗 astrid has quit IRC (Read error: Operation timed out)
05:05 🔗 wp494_ has joined #archiveteam-ot
05:09 🔗 chirlu has quit IRC (Excess Flood)
05:09 🔗 wp494 has quit IRC (Ping timeout: 1714 seconds)
05:09 🔗 Soni has quit IRC (Excess Flood)
05:09 🔗 zino__ has quit IRC (Excess Flood)
05:09 🔗 chirlu has joined #archiveteam-ot
05:09 🔗 Soni has joined #archiveteam-ot
05:09 🔗 zino__ has joined #archiveteam-ot
05:09 🔗 MrRadar has joined #archiveteam-ot
05:18 🔗 wp494_ has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
05:19 🔗 wp494 has joined #archiveteam-ot
05:22 🔗 astrid has joined #archiveteam-ot
05:22 🔗 svchfoo1 sets mode: +o astrid
05:28 🔗 Flashfire https://imgur.com/gallery/ekbUT5y
05:44 🔗 dxrt has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 JH88 has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 betamax has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 m007a83 has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 joepie91 has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 Odd0002 has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 Flashfire has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 Aoede has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 kiskaBak has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 hook54321 has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 Frogging has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 MrRadar2 has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 Tenebrae has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 BnAboyZ has quit IRC (efnet.portlane.se se.hub)
05:44 🔗 VoynichCr has quit IRC (efnet.portlane.se se.hub)
05:58 🔗 logchfoo0 starts logging #archiveteam-ot at Fri Aug 17 05:58:03 2018
05:58 🔗 logchfoo0 has joined #archiveteam-ot
06:12 🔗 HCross has joined #archiveteam-ot
06:12 🔗 dxrt_ has joined #archiveteam-ot
06:12 🔗 Kaz has joined #archiveteam-ot
06:12 🔗 svchfoo1 has joined #archiveteam-ot
06:12 🔗 efnet.portlane.se sets mode: +oooo HCross dxrt_ Kaz svchfoo1
06:12 🔗 Polylith has joined #archiveteam-ot
06:12 🔗 Jon has joined #archiveteam-ot
06:25 🔗 MrRadar has quit IRC (Read error: Connection reset by peer)
06:29 🔗 Soni has quit IRC (Ping timeout: 264 seconds)
06:30 🔗 astrid has quit IRC (Excess Flood)
06:30 🔗 Soni has joined #archiveteam-ot
06:31 🔗 schbirid has joined #archiveteam-ot
06:32 🔗 astrid has joined #archiveteam-ot
06:34 🔗 astrid has quit IRC (Excess Flood)
06:37 🔗 MrRadar has joined #archiveteam-ot
06:38 🔗 Stilett0 has quit IRC ()
06:39 🔗 astrid has joined #archiveteam-ot
06:40 🔗 svchfoo1 sets mode: +o astrid
06:40 🔗 Gfy_ has quit IRC (Read error: Operation timed out)
06:41 🔗 chirlu has quit IRC (Excess Flood)
06:41 🔗 chirlu has joined #archiveteam-ot
06:45 🔗 tyzoid has quit IRC (Read error: Operation timed out)
06:49 🔗 Gfy has joined #archiveteam-ot
06:49 🔗 astrid has quit IRC (Write error: Broken pipe)
06:50 🔗 tyzoid has joined #archiveteam-ot
06:52 🔗 Stilett0 has joined #archiveteam-ot
06:54 🔗 astrid has joined #archiveteam-ot
06:54 🔗 t2t2 has quit IRC (Read error: Operation timed out)
06:55 🔗 t2t2 has joined #archiveteam-ot
06:55 🔗 svchfoo1 sets mode: +o astrid
07:06 🔗 jut has quit IRC (Ping timeout: 600 seconds)
07:06 🔗 jut has joined #archiveteam-ot
07:11 🔗 MrRadar has quit IRC (Read error: Operation timed out)
07:19 🔗 MrRadar has joined #archiveteam-ot
07:20 🔗 HCross has quit IRC ()
07:21 🔗 HCross has joined #archiveteam-ot
07:21 🔗 svchfoo3 sets mode: +o HCross
07:22 🔗 astrid has quit IRC (Excess Flood)
07:22 🔗 chirlu has quit IRC (Read error: Operation timed out)
07:25 🔗 zino__ has quit IRC (Read error: Operation timed out)
07:28 🔗 MrRadar has quit IRC (Ping timeout: 360 seconds)
07:32 🔗 chirlu has joined #archiveteam-ot
07:33 🔗 MrRadar has joined #archiveteam-ot
07:34 🔗 zino has joined #archiveteam-ot
07:35 🔗 svchfoo3 sets mode: +o zino
07:42 🔗 schbirid has quit IRC (Remote host closed the connection)
07:42 🔗 MrRadar has quit IRC (Ping timeout: 360 seconds)
07:45 🔗 astrid has joined #archiveteam-ot
07:49 🔗 astrid has quit IRC (Excess Flood)
07:50 🔗 astrid has joined #archiveteam-ot
07:51 🔗 svchfoo1 sets mode: +o astrid
07:51 🔗 MrRadar has joined #archiveteam-ot
08:35 🔗 chferfa has joined #archiveteam-ot
08:38 🔗 Jens has quit IRC (Read error: Connection reset by peer)
08:39 🔗 Jens has joined #archiveteam-ot
08:52 🔗 Jens has quit IRC (Remote host closed the connection)
08:52 🔗 Jens has joined #archiveteam-ot
09:38 🔗 dxrt has quit IRC (Quit: ZNC - http://znc.sourceforge.net)
09:38 🔗 dxrt has joined #archiveteam-ot
09:38 🔗 svchfoo3 sets mode: +o dxrt
09:48 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
10:27 🔗 t2t2 has quit IRC (Quit: No Ping reply in 210 seconds.)
10:28 🔗 t2t2 has joined #archiveteam-ot
11:05 🔗 JAA ivan: Yeah, we need a browser-based solution for sure. Problem is, it's very difficult to do that reliably for the general case, in particular detecting links and link-like things reliably, e.g. when content is loaded through XHR and inserted into the DOM.
11:07 🔗 JAA One thing worth mentioning regarding those JS WARC thingies is that they'll almost certainly fail to preserve the raw data sent by the server since the browser API typically doesn't provide access to that (only to a sanitised version with transfer encoding stripped, headers cleaned, etc.).
11:07 🔗 JAA That's why brozzler uses warcprox for recording, I assume.
11:07 🔗 JAA (I have never tried brozzler though, so no idea how well it works.)
11:08 🔗 JAA I'm still using wpull 1.2.3 for almost all my manual archival as well. It's mostly stable and reliable and works well for anything that doesn't make too heavy use of JS.
11:10 🔗 JAA Though I've grabbed a few things with custom code based on aiohttp and warcio recently as well.
12:06 🔗 wp494 has quit IRC (Ping timeout: 260 seconds)
12:06 🔗 wp494 has joined #archiveteam-ot
12:35 🔗 chferfa has quit IRC (Ping timeout: 252 seconds)
12:37 🔗 jut has quit IRC (Quit: WeeChat 1.4)
12:40 🔗 chferfa has joined #archiveteam-ot
12:59 🔗 HCross JAA: they even rebooked me for free after I turned up late
12:59 🔗 JAA HCross: Oh nice. Why were you late? SNCF being SNCF?
13:00 🔗 HCross Nope, doing Brussels > London after coming down from the Netherlands this morning. Got my Brussels stations mixed up
13:00 🔗 JAA Ah
13:19 🔗 schbirid has joined #archiveteam-ot
13:49 🔗 JAA https://old.reddit.com/r/AskReddit/comments/980rrc/what_do_you_miss_about_the_early_internet/ Aah, nostalgia.
14:08 🔗 bithippo has joined #archiveteam-ot
14:42 🔗 martini has joined #archiveteam-ot
15:28 🔗 odemg has joined #archiveteam-ot
18:50 🔗 Stilett0 has quit IRC (Read error: Connection reset by peer)
18:51 🔗 Stilett0 has joined #archiveteam-ot
19:01 🔗 caff_ has joined #archiveteam-ot
19:25 🔗 m007a83 has quit IRC (Read error: Connection reset by peer)
19:27 🔗 m007a83 has joined #archiveteam-ot
19:32 🔗 m007a83 has quit IRC (Read error: Connection reset by peer)
19:33 🔗 m007a83 has joined #archiveteam-ot
19:36 🔗 m007a83 has quit IRC (Read error: Connection reset by peer)
19:38 🔗 m007a83 has joined #archiveteam-ot
20:06 🔗 martini has quit IRC (Remote host closed the connection)
21:23 🔗 Belgium has joined #archiveteam-ot
21:24 🔗 Belgium has left
22:36 🔗 schbirid has quit IRC (Remote host closed the connection)
23:23 🔗 m007a83 has quit IRC (Remote host closed the connection)
23:23 🔗 m007a83_ has joined #archiveteam-ot

irclogger-viewer