#archiveteam-bs 2017-03-13,Mon

↑back Search

Time Nickname Message
00:01 🔗 Stilett0- has quit IRC (Read error: Operation timed out)
00:25 🔗 GE has quit IRC (Quit: zzz)
00:53 🔗 kristian_ has joined #archiveteam-bs
01:09 🔗 amiiboh has joined #archiveteam-bs
01:49 🔗 bwn yipdw: no, i think the offset and length you're seeing are only related to the s3 stuff, they're not exposed in the cli. It could probably be made to do so without too much trouble
02:01 🔗 ndiddy has joined #archiveteam-bs
02:01 🔗 ndiddy has left
02:10 🔗 Roelandus has quit IRC (Ping timeout: 268 seconds)
02:21 🔗 j08nY has quit IRC (Quit: Leaving)
02:37 🔗 pizzaiol1 has quit IRC (Remote host closed the connection)
03:36 🔗 VADemon has quit IRC (Quit: left4dead)
03:58 🔗 Aranje has quit IRC (Quit: Three sheets to the wind)
04:00 🔗 kyounko has joined #archiveteam-bs
04:37 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
04:37 🔗 BlueMaxim has joined #archiveteam-bs
04:53 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)
04:53 🔗 BlueMaxim has joined #archiveteam-bs
05:46 🔗 zhongfu_ has joined #archiveteam-bs
05:46 🔗 zhongfu has quit IRC (Ping timeout: 260 seconds)
05:55 🔗 Sk1d has quit IRC (Ping timeout: 250 seconds)
06:02 🔗 Sk1d has joined #archiveteam-bs
06:22 🔗 Stiletto has quit IRC (Ping timeout: 244 seconds)
06:30 🔗 kyounko has quit IRC (KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
06:35 🔗 midas1 has joined #archiveteam-bs
06:50 🔗 Stilett0- has joined #archiveteam-bs
07:21 🔗 kristian_ has quit IRC (Quit: Leaving)
07:31 🔗 schbirid has joined #archiveteam-bs
07:47 🔗 Honno has joined #archiveteam-bs
07:58 🔗 odemg has quit IRC (Remote host closed the connection)
08:36 🔗 GE has joined #archiveteam-bs
09:21 🔗 j08nY has joined #archiveteam-bs
09:49 🔗 mls has joined #archiveteam-bs
09:55 🔗 mls has quit IRC (Quit: leaving)
10:22 🔗 GE has quit IRC (Remote host closed the connection)
10:56 🔗 Honno has quit IRC (Ping timeout: 370 seconds)
11:08 🔗 BlueMaxim has quit IRC (Quit: Leaving)
11:38 🔗 JAA has joined #archiveteam-bs
11:43 🔗 JAA Hi guys. As you probably know already, DMOZ is shutting down tomorrow. MasterX24 is running a grab (but doesn't seem to be here right now; his last status update from Thursday was "1M urls done, 780k left"), and I found two items on the Internet Archive which look like ArchiveBot grabs.
11:43 🔗 JAA I had a closer look at the second of those latter grabs (https://archive.org/details/falconk_archivebot_www_dmoz_org_20170302), and I discovered that the archive isn't really complete: 12.8% of the archived pages (23268 of 181366) are useless status 420 "Please see our terms of use" error pages.
11:44 🔗 JAA I'm aware that the most crucial data resides in the RDF files, which appear to be safely backed up (at https://archive.org/details/falconk_archivebot_rdf_dmoz_org_20170228), but I believe that doesn't cover everything and also isn't as user-friendly (clicking through pages on the Wayback Machine vs. having to find some way to parse RDF files etc.).
11:44 🔗 JAA So, what can be done about this?
11:45 🔗 SketchCow We can take another shot
11:46 🔗 JAA Would that be fast enough though? As said, they'll shut down tomorrow, and it looks like the previous grabs took several days.
11:48 🔗 JAA I guess it should be possible to extract the 420 pages from the CDX and just archive those again, but I'm not sure how to do that (in particular, the latter part; haven't worked with wget/WARC directly yet)
12:01 🔗 j08nY has quit IRC (Read error: Operation timed out)
12:01 🔗 GE has joined #archiveteam-bs
12:10 🔗 odemg has joined #archiveteam-bs
12:50 🔗 pizzaiolo has joined #archiveteam-bs
14:32 🔗 j08nY has joined #archiveteam-bs
14:44 🔗 odemg https://www.reddit.com/r/DataHoarder/comments/5z2499/many_of_the_uc_berkeley_youtube_videos_still_need
14:45 🔗 odemg has quit IRC (Remote host closed the connection)
14:51 🔗 Roelandus has joined #archiveteam-bs
14:52 🔗 Roelandus Has anyone found a solution to opening large warc files?
14:56 🔗 JAA Have you tried pywb? I don't know how large your files are and how well it handles those, but that's what I've been using so far.
15:05 🔗 JAA So, regarding DMOZ: I looked at the 420 pages in more detail, and those are almost exclusively pages banned by robots.txt. I guess DMOZ has some server-side detection of crawlers and blocks the access.
15:05 🔗 JAA There are a handful of other pages in there, plus all the editors' profile pages (which should actually be crawlable according to the robots.txt).
15:09 🔗 Roelandus I'm trying to open the hyves database but the file is too large
15:10 🔗 Roelandus has quit IRC (Quit: Page closed)
15:11 🔗 Jonison has joined #archiveteam-bs
15:11 🔗 JAA After filtering out the /public/{abuse,apply,flag,sendemail,suggest} pages, that's just 296 pages. So it's not as bad as I thought before.
15:11 🔗 Roelandus has joined #archiveteam-bs
15:13 🔗 JAA Roelandus: how large is that file?
15:14 🔗 Roelandus like 50GB compressed
15:14 🔗 Roelandus 100GB uncompressed
15:15 🔗 JAA Hmm, yeah, that's quite a bit larger than what I've worked with.
15:16 🔗 JAA Still, I'd try out pywb. If it doesn't like the file, maybe split it with something like warcat.
15:17 🔗 Roelandus I tried splitting but the problem was that when I opened a split file it didnt show me a webpage
15:17 🔗 Roelandus It showed me just nothing
15:21 🔗 odemg has joined #archiveteam-bs
15:22 🔗 odemg has quit IRC (Remote host closed the connection)
15:26 🔗 j08nY has quit IRC (Ping timeout: 633 seconds)
15:31 🔗 j08nY has joined #archiveteam-bs
15:46 🔗 odemg has joined #archiveteam-bs
15:50 🔗 tephra_ has quit IRC (Ping timeout: 260 seconds)
15:50 🔗 tephra has joined #archiveteam-bs
16:24 🔗 odemg has quit IRC (Remote host closed the connection)
16:40 🔗 odemg has joined #archiveteam-bs
16:44 🔗 Honno has joined #archiveteam-bs
16:45 🔗 tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
16:58 🔗 Roelandus has quit IRC (Ping timeout: 268 seconds)
17:11 🔗 Roelandus has joined #archiveteam-bs
17:11 🔗 Roelandus I don't understand pywb
17:16 🔗 odemg has quit IRC (Remote host closed the connection)
17:27 🔗 JAA https://pypi.python.org/pypi/pywb describes it pretty well I think. I have to mention that the "Using Existing Web Archive Collections" part didn't work for me last time I tried. Adding the WARCs directly is fine though, but it'll take quite some time (since it re-indexes the WARC).
17:32 🔗 pizzaiolo has quit IRC (Ping timeout: 245 seconds)
17:38 🔗 Roelandus it sucks there arent any yt tutorials available on warcs
17:40 🔗 Roelandus what cmds do I have to put in to ms powershell?
17:41 🔗 JAA Oh, Windows. No idea, haven't used it in many years.
17:54 🔗 arkiver this looks great https://github.com/webrecorder/warcio
19:09 🔗 schbirid https://sandstorm.io/news/2017-03-13-joining-cloudflare :|
19:18 🔗 Stilett0- has quit IRC ()
19:23 🔗 Jonison has quit IRC (Quit: Leaving)
19:24 🔗 odemg has joined #archiveteam-bs
19:29 🔗 ndiddy has joined #archiveteam-bs
19:40 🔗 Roelandus Are you guys all on linux? if so, on what OS?
19:42 🔗 JAA_ has joined #archiveteam-bs
19:42 🔗 JAA_ I'm using Debian
19:44 🔗 MrRadar Same (though my primary desktop is still Windows 7. Planning to switch over to desktop Linux once 7 goes out of support.)
19:45 🔗 JAA has quit IRC (Ping timeout: 268 seconds)
19:45 🔗 JAA_ is now known as JAA
19:48 🔗 Frogging start phasing it in now so the switch isn't jarring :)
19:48 🔗 Frogging use programs with linux versions, etc
19:49 🔗 MrRadar Yep. My main browser is Firefox, main text editor is Vim, using Open Office on the rare occasions I need to do office-type work
19:49 🔗 MrRadar All my programming projects are done on my Debian server
19:50 🔗 JAA s/OpenOffice/LibreOffice/g
19:50 🔗 MrRadar Right.
19:50 🔗 GE has quit IRC (Remote host closed the connection)
19:54 🔗 Honno has quit IRC (Ping timeout: 370 seconds)
19:55 🔗 Roelandus Well, microsoft's background services are bs. But if you want to play games like Overwatch you pretty much have to do that on windows.
20:05 🔗 odemg has quit IRC (Remote host closed the connection)
20:18 🔗 Frogging I'm glad all the games I play run on Linux, since they're mostly indie management sims and such.
20:18 🔗 Frogging RimWorld <3
20:25 🔗 Stilett0- has joined #archiveteam-bs
20:26 🔗 Stilett0- is now known as Stilett0
21:12 🔗 Stilett0 has quit IRC ()
21:16 🔗 GE has joined #archiveteam-bs
21:35 🔗 BlueMaxim has joined #archiveteam-bs
22:12 🔗 Honno has joined #archiveteam-bs
22:14 🔗 JAA First time using WARCing wget. Does this look like a decent command? Any recommendations? wget --user-agent ArchiveTeam --output-file ./wget.log --output-document ./wget.tmp -e robots=off --page-requisites --timeout 30 --tries inf --waitretry 30 --warc-file ./grab --input-file ./links
22:15 🔗 REiN^ has quit IRC (Read error: Operation timed out)
22:21 🔗 JAA I noticed that this command repeatedly downloads resources (images, stylesheets, etc.) shared between the individual links. That inflates the WARC and isn't what a browser would usually be doing anyway. Is it possible to avoid that?
22:21 🔗 schbirid JAA: consider using wpull instead rightaway
22:22 🔗 GE has quit IRC (Remote host closed the connection)
22:22 🔗 schbirid you could use -nc maybe
22:30 🔗 rocode wpull is nice, grabsite is best if you are managing multiple grabs.
22:30 🔗 rocode (note that grab-site manages wpull instances)
22:33 🔗 schbirid has quit IRC (Read error: Operation timed out)
22:35 🔗 ZexaronS has joined #archiveteam-bs
22:36 🔗 REiN^ has joined #archiveteam-bs
22:37 🔗 JAA Thanks, looks interesting. Right now, I'm just trying to get the job done so I can go to bed, but I'll definitely have to look at those projects again.
22:38 🔗 JAA (I believe that I've had issues with installing Tornado at some point, and I don't really want to go down that rabbit hole right now.)
22:38 🔗 ZexaronS Just a headsup people http://news.berkeley.edu/2017/03/01/course-capture/
22:39 🔗 JAA ZexaronS: yup: http://archiveteam.org/index.php?title=UC_Berkeley_Course_Captures :-)
22:45 🔗 JAA By the way, regarding --no-clobber/-nc: "WARC output does not work with --no-clobber, --no-clobber will be disabled."
22:46 🔗 bwn has quit IRC (Read error: Operation timed out)
22:48 🔗 ZexaronS Sorry, im late, the newsite i looked at is like 2 weeks late too
22:54 🔗 bwn has joined #archiveteam-bs
22:57 🔗 JAA Uh oh, wget eats all my memory until it's killed by the kernel...
23:00 🔗 JAA I just noticed that Tornado is also a dependency of seesaw and thus already installed on my machine. Nevermind that earlier comment. I'll try wpull now.
23:01 🔗 Stilett0 has joined #archiveteam-bs
23:01 🔗 Stilett0 is now known as Stiletto
23:06 🔗 bwn has quit IRC (Ping timeout: 244 seconds)
23:07 🔗 JAA Yeah, that's not working too well either. Hitting https://github.com/chfoo/wpull/issues/349 among others
23:14 🔗 bwn has joined #archiveteam-bs
23:16 🔗 JAA Looks like the wget-lua build with --truncate-output works. Memory usage is still increasing with time, but not as drastically.
23:26 🔗 JAA ... but that doesn't download the same files as without that option.
23:37 🔗 JAA Alright, I give up. Here's the list of the pages in ArchiveBot's DMOZ grab which are an error 420 page (cf. my messages around 11:45 UTC) if anyone with more experience than me wants to take a shot: https://ghostbin.com/paste/2ejhg
23:41 🔗 BlueMaxim has quit IRC (Read error: Operation timed out)

irclogger-viewer