[00:01] *** Stilett0- has quit IRC (Read error: Operation timed out) [00:25] *** GE has quit IRC (Quit: zzz) [00:53] *** kristian_ has joined #archiveteam-bs [01:09] *** amiiboh has joined #archiveteam-bs [01:49] yipdw: no, i think the offset and length you're seeing are only related to the s3 stuff, they're not exposed in the cli. It could probably be made to do so without too much trouble [02:01] *** ndiddy has joined #archiveteam-bs [02:01] *** ndiddy has left [02:10] *** Roelandus has quit IRC (Ping timeout: 268 seconds) [02:21] *** j08nY has quit IRC (Quit: Leaving) [02:37] *** pizzaiol1 has quit IRC (Remote host closed the connection) [03:36] *** VADemon has quit IRC (Quit: left4dead) [03:58] *** Aranje has quit IRC (Quit: Three sheets to the wind) [04:00] *** kyounko has joined #archiveteam-bs [04:37] *** BlueMaxim has quit IRC (Read error: Operation timed out) [04:37] *** BlueMaxim has joined #archiveteam-bs [04:53] *** BlueMaxim has quit IRC (Read error: Operation timed out) [04:53] *** BlueMaxim has joined #archiveteam-bs [05:46] *** zhongfu_ has joined #archiveteam-bs [05:46] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [05:55] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [06:02] *** Sk1d has joined #archiveteam-bs [06:22] *** Stiletto has quit IRC (Ping timeout: 244 seconds) [06:30] *** kyounko has quit IRC (KVIrc 4.2.0 Equilibrium http://www.kvirc.net/) [06:35] *** midas1 has joined #archiveteam-bs [06:50] *** Stilett0- has joined #archiveteam-bs [07:21] *** kristian_ has quit IRC (Quit: Leaving) [07:31] *** schbirid has joined #archiveteam-bs [07:47] *** Honno has joined #archiveteam-bs [07:58] *** odemg has quit IRC (Remote host closed the connection) [08:36] *** GE has joined #archiveteam-bs [09:21] *** j08nY has joined #archiveteam-bs [09:49] *** mls has joined #archiveteam-bs [09:55] *** mls has quit IRC (Quit: leaving) [10:22] *** GE has quit IRC (Remote host closed the connection) [10:56] *** Honno has quit IRC (Ping timeout: 370 seconds) [11:08] *** BlueMaxim has quit IRC (Quit: Leaving) [11:38] *** JAA has joined #archiveteam-bs [11:43] Hi guys. As you probably know already, DMOZ is shutting down tomorrow. MasterX24 is running a grab (but doesn't seem to be here right now; his last status update from Thursday was "1M urls done, 780k left"), and I found two items on the Internet Archive which look like ArchiveBot grabs. [11:43] I had a closer look at the second of those latter grabs (https://archive.org/details/falconk_archivebot_www_dmoz_org_20170302), and I discovered that the archive isn't really complete: 12.8% of the archived pages (23268 of 181366) are useless status 420 "Please see our terms of use" error pages. [11:44] I'm aware that the most crucial data resides in the RDF files, which appear to be safely backed up (at https://archive.org/details/falconk_archivebot_rdf_dmoz_org_20170228), but I believe that doesn't cover everything and also isn't as user-friendly (clicking through pages on the Wayback Machine vs. having to find some way to parse RDF files etc.). [11:44] So, what can be done about this? [11:45] We can take another shot [11:46] Would that be fast enough though? As said, they'll shut down tomorrow, and it looks like the previous grabs took several days. [11:48] I guess it should be possible to extract the 420 pages from the CDX and just archive those again, but I'm not sure how to do that (in particular, the latter part; haven't worked with wget/WARC directly yet) [12:01] *** j08nY has quit IRC (Read error: Operation timed out) [12:01] *** GE has joined #archiveteam-bs [12:10] *** odemg has joined #archiveteam-bs [12:50] *** pizzaiolo has joined #archiveteam-bs [14:32] *** j08nY has joined #archiveteam-bs [14:44] https://www.reddit.com/r/DataHoarder/comments/5z2499/many_of_the_uc_berkeley_youtube_videos_still_need [14:45] *** odemg has quit IRC (Remote host closed the connection) [14:51] *** Roelandus has joined #archiveteam-bs [14:52] Has anyone found a solution to opening large warc files? [14:56] Have you tried pywb? I don't know how large your files are and how well it handles those, but that's what I've been using so far. [15:05] So, regarding DMOZ: I looked at the 420 pages in more detail, and those are almost exclusively pages banned by robots.txt. I guess DMOZ has some server-side detection of crawlers and blocks the access. [15:05] There are a handful of other pages in there, plus all the editors' profile pages (which should actually be crawlable according to the robots.txt). [15:09] I'm trying to open the hyves database but the file is too large [15:10] *** Roelandus has quit IRC (Quit: Page closed) [15:11] *** Jonison has joined #archiveteam-bs [15:11] After filtering out the /public/{abuse,apply,flag,sendemail,suggest} pages, that's just 296 pages. So it's not as bad as I thought before. [15:11] *** Roelandus has joined #archiveteam-bs [15:13] Roelandus: how large is that file? [15:14] like 50GB compressed [15:14] 100GB uncompressed [15:15] Hmm, yeah, that's quite a bit larger than what I've worked with. [15:16] Still, I'd try out pywb. If it doesn't like the file, maybe split it with something like warcat. [15:17] I tried splitting but the problem was that when I opened a split file it didnt show me a webpage [15:17] It showed me just nothing [15:21] *** odemg has joined #archiveteam-bs [15:22] *** odemg has quit IRC (Remote host closed the connection) [15:26] *** j08nY has quit IRC (Ping timeout: 633 seconds) [15:31] *** j08nY has joined #archiveteam-bs [15:46] *** odemg has joined #archiveteam-bs [15:50] *** tephra_ has quit IRC (Ping timeout: 260 seconds) [15:50] *** tephra has joined #archiveteam-bs [16:24] *** odemg has quit IRC (Remote host closed the connection) [16:40] *** odemg has joined #archiveteam-bs [16:44] *** Honno has joined #archiveteam-bs [16:45] *** tfgbd_znc has quit IRC (Ping timeout: 600 seconds) [16:58] *** Roelandus has quit IRC (Ping timeout: 268 seconds) [17:11] *** Roelandus has joined #archiveteam-bs [17:11] I don't understand pywb [17:16] *** odemg has quit IRC (Remote host closed the connection) [17:27] https://pypi.python.org/pypi/pywb describes it pretty well I think. I have to mention that the "Using Existing Web Archive Collections" part didn't work for me last time I tried. Adding the WARCs directly is fine though, but it'll take quite some time (since it re-indexes the WARC). [17:32] *** pizzaiolo has quit IRC (Ping timeout: 245 seconds) [17:38] it sucks there arent any yt tutorials available on warcs [17:40] what cmds do I have to put in to ms powershell? [17:41] Oh, Windows. No idea, haven't used it in many years. [17:54] this looks great https://github.com/webrecorder/warcio [19:09] https://sandstorm.io/news/2017-03-13-joining-cloudflare :| [19:18] *** Stilett0- has quit IRC () [19:23] *** Jonison has quit IRC (Quit: Leaving) [19:24] *** odemg has joined #archiveteam-bs [19:29] *** ndiddy has joined #archiveteam-bs [19:40] Are you guys all on linux? if so, on what OS? [19:42] *** JAA_ has joined #archiveteam-bs [19:42] I'm using Debian [19:44] Same (though my primary desktop is still Windows 7. Planning to switch over to desktop Linux once 7 goes out of support.) [19:45] *** JAA has quit IRC (Ping timeout: 268 seconds) [19:45] *** JAA_ is now known as JAA [19:48] start phasing it in now so the switch isn't jarring :) [19:48] use programs with linux versions, etc [19:49] Yep. My main browser is Firefox, main text editor is Vim, using Open Office on the rare occasions I need to do office-type work [19:49] All my programming projects are done on my Debian server [19:50] s/OpenOffice/LibreOffice/g [19:50] Right. [19:50] *** GE has quit IRC (Remote host closed the connection) [19:54] *** Honno has quit IRC (Ping timeout: 370 seconds) [19:55] Well, microsoft's background services are bs. But if you want to play games like Overwatch you pretty much have to do that on windows. [20:05] *** odemg has quit IRC (Remote host closed the connection) [20:18] I'm glad all the games I play run on Linux, since they're mostly indie management sims and such. [20:18] RimWorld <3 [20:25] *** Stilett0- has joined #archiveteam-bs [20:26] *** Stilett0- is now known as Stilett0 [21:12] *** Stilett0 has quit IRC () [21:16] *** GE has joined #archiveteam-bs [21:35] *** BlueMaxim has joined #archiveteam-bs [22:12] *** Honno has joined #archiveteam-bs [22:14] First time using WARCing wget. Does this look like a decent command? Any recommendations? wget --user-agent ArchiveTeam --output-file ./wget.log --output-document ./wget.tmp -e robots=off --page-requisites --timeout 30 --tries inf --waitretry 30 --warc-file ./grab --input-file ./links [22:15] *** REiN^ has quit IRC (Read error: Operation timed out) [22:21] I noticed that this command repeatedly downloads resources (images, stylesheets, etc.) shared between the individual links. That inflates the WARC and isn't what a browser would usually be doing anyway. Is it possible to avoid that? [22:21] JAA: consider using wpull instead rightaway [22:22] *** GE has quit IRC (Remote host closed the connection) [22:22] you could use -nc maybe [22:30] wpull is nice, grabsite is best if you are managing multiple grabs. [22:30] (note that grab-site manages wpull instances) [22:33] *** schbirid has quit IRC (Read error: Operation timed out) [22:35] *** ZexaronS has joined #archiveteam-bs [22:36] *** REiN^ has joined #archiveteam-bs [22:37] Thanks, looks interesting. Right now, I'm just trying to get the job done so I can go to bed, but I'll definitely have to look at those projects again. [22:38] (I believe that I've had issues with installing Tornado at some point, and I don't really want to go down that rabbit hole right now.) [22:38] Just a headsup people http://news.berkeley.edu/2017/03/01/course-capture/ [22:39] ZexaronS: yup: http://archiveteam.org/index.php?title=UC_Berkeley_Course_Captures :-) [22:45] By the way, regarding --no-clobber/-nc: "WARC output does not work with --no-clobber, --no-clobber will be disabled." [22:46] *** bwn has quit IRC (Read error: Operation timed out) [22:48] Sorry, im late, the newsite i looked at is like 2 weeks late too [22:54] *** bwn has joined #archiveteam-bs [22:57] Uh oh, wget eats all my memory until it's killed by the kernel... [23:00] I just noticed that Tornado is also a dependency of seesaw and thus already installed on my machine. Nevermind that earlier comment. I'll try wpull now. [23:01] *** Stilett0 has joined #archiveteam-bs [23:01] *** Stilett0 is now known as Stiletto [23:06] *** bwn has quit IRC (Ping timeout: 244 seconds) [23:07] Yeah, that's not working too well either. Hitting https://github.com/chfoo/wpull/issues/349 among others [23:14] *** bwn has joined #archiveteam-bs [23:16] Looks like the wget-lua build with --truncate-output works. Memory usage is still increasing with time, but not as drastically. [23:26] ... but that doesn't download the same files as without that option. [23:37] Alright, I give up. Here's the list of the pages in ArchiveBot's DMOZ grab which are an error 420 page (cf. my messages around 11:45 UTC) if anyone with more experience than me wants to take a shot: https://ghostbin.com/paste/2ejhg [23:41] *** BlueMaxim has quit IRC (Read error: Operation timed out)