#archiveteam-bs 2017-03-13,Mon

↑back Search

Time	Nickname	Message
00:01 ^🔗		Stilett0- has quit IRC (Read error: Operation timed out)
00:25 ^🔗		GE has quit IRC (Quit: zzz)
00:53 ^🔗		kristian_ has joined #archiveteam-bs
01:09 ^🔗		amiiboh has joined #archiveteam-bs
01:49 ^🔗	bwn	yipdw: no, i think the offset and length you're seeing are only related to the s3 stuff, they're not exposed in the cli. It could probably be made to do so without too much trouble
02:01 ^🔗		ndiddy has joined #archiveteam-bs
02:01 ^🔗		ndiddy has left
02:10 ^🔗		Roelandus has quit IRC (Ping timeout: 268 seconds)
02:21 ^🔗		j08nY has quit IRC (Quit: Leaving)
02:37 ^🔗		pizzaiol1 has quit IRC (Remote host closed the connection)
03:36 ^🔗		VADemon has quit IRC (Quit: left4dead)
03:58 ^🔗		Aranje has quit IRC (Quit: Three sheets to the wind)
04:00 ^🔗		kyounko has joined #archiveteam-bs
04:37 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
04:37 ^🔗		BlueMaxim has joined #archiveteam-bs
04:53 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)
04:53 ^🔗		BlueMaxim has joined #archiveteam-bs
05:46 ^🔗		zhongfu_ has joined #archiveteam-bs
05:46 ^🔗		zhongfu has quit IRC (Ping timeout: 260 seconds)
05:55 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
06:02 ^🔗		Sk1d has joined #archiveteam-bs
06:22 ^🔗		Stiletto has quit IRC (Ping timeout: 244 seconds)
06:30 ^🔗		kyounko has quit IRC (KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
06:35 ^🔗		midas1 has joined #archiveteam-bs
06:50 ^🔗		Stilett0- has joined #archiveteam-bs
07:21 ^🔗		kristian_ has quit IRC (Quit: Leaving)
07:31 ^🔗		schbirid has joined #archiveteam-bs
07:47 ^🔗		Honno has joined #archiveteam-bs
07:58 ^🔗		odemg has quit IRC (Remote host closed the connection)
08:36 ^🔗		GE has joined #archiveteam-bs
09:21 ^🔗		j08nY has joined #archiveteam-bs
09:49 ^🔗		mls has joined #archiveteam-bs
09:55 ^🔗		mls has quit IRC (Quit: leaving)
10:22 ^🔗		GE has quit IRC (Remote host closed the connection)
10:56 ^🔗		Honno has quit IRC (Ping timeout: 370 seconds)
11:08 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
11:38 ^🔗		JAA has joined #archiveteam-bs
11:43 ^🔗	JAA	Hi guys. As you probably know already, DMOZ is shutting down tomorrow. MasterX24 is running a grab (but doesn't seem to be here right now; his last status update from Thursday was "1M urls done, 780k left"), and I found two items on the Internet Archive which look like ArchiveBot grabs.
11:43 ^🔗	JAA	I had a closer look at the second of those latter grabs (https://archive.org/details/falconk_archivebot_www_dmoz_org_20170302), and I discovered that the archive isn't really complete: 12.8% of the archived pages (23268 of 181366) are useless status 420 "Please see our terms of use" error pages.
11:44 ^🔗	JAA	I'm aware that the most crucial data resides in the RDF files, which appear to be safely backed up (at https://archive.org/details/falconk_archivebot_rdf_dmoz_org_20170228), but I believe that doesn't cover everything and also isn't as user-friendly (clicking through pages on the Wayback Machine vs. having to find some way to parse RDF files etc.).
11:44 ^🔗	JAA	So, what can be done about this?
11:45 ^🔗	SketchCow	We can take another shot
11:46 ^🔗	JAA	Would that be fast enough though? As said, they'll shut down tomorrow, and it looks like the previous grabs took several days.
11:48 ^🔗	JAA	I guess it should be possible to extract the 420 pages from the CDX and just archive those again, but I'm not sure how to do that (in particular, the latter part; haven't worked with wget/WARC directly yet)
12:01 ^🔗		j08nY has quit IRC (Read error: Operation timed out)
12:01 ^🔗		GE has joined #archiveteam-bs
12:10 ^🔗		odemg has joined #archiveteam-bs
12:50 ^🔗		pizzaiolo has joined #archiveteam-bs
14:32 ^🔗		j08nY has joined #archiveteam-bs
14:44 ^🔗	odemg	https://www.reddit.com/r/DataHoarder/comments/5z2499/many_of_the_uc_berkeley_youtube_videos_still_need
14:45 ^🔗		odemg has quit IRC (Remote host closed the connection)
14:51 ^🔗		Roelandus has joined #archiveteam-bs
14:52 ^🔗	Roelandus	Has anyone found a solution to opening large warc files?
14:56 ^🔗	JAA	Have you tried pywb? I don't know how large your files are and how well it handles those, but that's what I've been using so far.
15:05 ^🔗	JAA	So, regarding DMOZ: I looked at the 420 pages in more detail, and those are almost exclusively pages banned by robots.txt. I guess DMOZ has some server-side detection of crawlers and blocks the access.
15:05 ^🔗	JAA	There are a handful of other pages in there, plus all the editors' profile pages (which should actually be crawlable according to the robots.txt).
15:09 ^🔗	Roelandus	I'm trying to open the hyves database but the file is too large
15:10 ^🔗		Roelandus has quit IRC (Quit: Page closed)
15:11 ^🔗		Jonison has joined #archiveteam-bs
15:11 ^🔗	JAA	After filtering out the /public/{abuse,apply,flag,sendemail,suggest} pages, that's just 296 pages. So it's not as bad as I thought before.
15:11 ^🔗		Roelandus has joined #archiveteam-bs
15:13 ^🔗	JAA	Roelandus: how large is that file?
15:14 ^🔗	Roelandus	like 50GB compressed
15:14 ^🔗	Roelandus	100GB uncompressed
15:15 ^🔗	JAA	Hmm, yeah, that's quite a bit larger than what I've worked with.
15:16 ^🔗	JAA	Still, I'd try out pywb. If it doesn't like the file, maybe split it with something like warcat.
15:17 ^🔗	Roelandus	I tried splitting but the problem was that when I opened a split file it didnt show me a webpage
15:17 ^🔗	Roelandus	It showed me just nothing
15:21 ^🔗		odemg has joined #archiveteam-bs
15:22 ^🔗		odemg has quit IRC (Remote host closed the connection)
15:26 ^🔗		j08nY has quit IRC (Ping timeout: 633 seconds)
15:31 ^🔗		j08nY has joined #archiveteam-bs
15:46 ^🔗		odemg has joined #archiveteam-bs
15:50 ^🔗		tephra_ has quit IRC (Ping timeout: 260 seconds)
15:50 ^🔗		tephra has joined #archiveteam-bs
16:24 ^🔗		odemg has quit IRC (Remote host closed the connection)
16:40 ^🔗		odemg has joined #archiveteam-bs
16:44 ^🔗		Honno has joined #archiveteam-bs
16:45 ^🔗		tfgbd_znc has quit IRC (Ping timeout: 600 seconds)
16:58 ^🔗		Roelandus has quit IRC (Ping timeout: 268 seconds)
17:11 ^🔗		Roelandus has joined #archiveteam-bs
17:11 ^🔗	Roelandus	I don't understand pywb
17:16 ^🔗		odemg has quit IRC (Remote host closed the connection)
17:27 ^🔗	JAA	https://pypi.python.org/pypi/pywb describes it pretty well I think. I have to mention that the "Using Existing Web Archive Collections" part didn't work for me last time I tried. Adding the WARCs directly is fine though, but it'll take quite some time (since it re-indexes the WARC).
17:32 ^🔗		pizzaiolo has quit IRC (Ping timeout: 245 seconds)
17:38 ^🔗	Roelandus	it sucks there arent any yt tutorials available on warcs
17:40 ^🔗	Roelandus	what cmds do I have to put in to ms powershell?
17:41 ^🔗	JAA	Oh, Windows. No idea, haven't used it in many years.
17:54 ^🔗	arkiver	this looks great https://github.com/webrecorder/warcio
19:09 ^🔗	schbirid	https://sandstorm.io/news/2017-03-13-joining-cloudflare :\|
19:18 ^🔗		Stilett0- has quit IRC ()
19:23 ^🔗		Jonison has quit IRC (Quit: Leaving)
19:24 ^🔗		odemg has joined #archiveteam-bs
19:29 ^🔗		ndiddy has joined #archiveteam-bs
19:40 ^🔗	Roelandus	Are you guys all on linux? if so, on what OS?
19:42 ^🔗		JAA_ has joined #archiveteam-bs
19:42 ^🔗	JAA_	I'm using Debian
19:44 ^🔗	MrRadar	Same (though my primary desktop is still Windows 7. Planning to switch over to desktop Linux once 7 goes out of support.)
19:45 ^🔗		JAA has quit IRC (Ping timeout: 268 seconds)
19:45 ^🔗		JAA_ is now known as JAA
19:48 ^🔗	Frogging	start phasing it in now so the switch isn't jarring :)
19:48 ^🔗	Frogging	use programs with linux versions, etc
19:49 ^🔗	MrRadar	Yep. My main browser is Firefox, main text editor is Vim, using Open Office on the rare occasions I need to do office-type work
19:49 ^🔗	MrRadar	All my programming projects are done on my Debian server
19:50 ^🔗	JAA	s/OpenOffice/LibreOffice/g
19:50 ^🔗	MrRadar	Right.
19:50 ^🔗		GE has quit IRC (Remote host closed the connection)
19:54 ^🔗		Honno has quit IRC (Ping timeout: 370 seconds)
19:55 ^🔗	Roelandus	Well, microsoft's background services are bs. But if you want to play games like Overwatch you pretty much have to do that on windows.
20:05 ^🔗		odemg has quit IRC (Remote host closed the connection)
20:18 ^🔗	Frogging	I'm glad all the games I play run on Linux, since they're mostly indie management sims and such.
20:18 ^🔗	Frogging	RimWorld <3
20:25 ^🔗		Stilett0- has joined #archiveteam-bs
20:26 ^🔗		Stilett0- is now known as Stilett0
21:12 ^🔗		Stilett0 has quit IRC ()
21:16 ^🔗		GE has joined #archiveteam-bs
21:35 ^🔗		BlueMaxim has joined #archiveteam-bs
22:12 ^🔗		Honno has joined #archiveteam-bs
22:14 ^🔗	JAA	First time using WARCing wget. Does this look like a decent command? Any recommendations? wget --user-agent ArchiveTeam --output-file ./wget.log --output-document ./wget.tmp -e robots=off --page-requisites --timeout 30 --tries inf --waitretry 30 --warc-file ./grab --input-file ./links
22:15 ^🔗		REiN^ has quit IRC (Read error: Operation timed out)
22:21 ^🔗	JAA	I noticed that this command repeatedly downloads resources (images, stylesheets, etc.) shared between the individual links. That inflates the WARC and isn't what a browser would usually be doing anyway. Is it possible to avoid that?
22:21 ^🔗	schbirid	JAA: consider using wpull instead rightaway
22:22 ^🔗		GE has quit IRC (Remote host closed the connection)
22:22 ^🔗	schbirid	you could use -nc maybe
22:30 ^🔗	rocode	wpull is nice, grabsite is best if you are managing multiple grabs.
22:30 ^🔗	rocode	(note that grab-site manages wpull instances)
22:33 ^🔗		schbirid has quit IRC (Read error: Operation timed out)
22:35 ^🔗		ZexaronS has joined #archiveteam-bs
22:36 ^🔗		REiN^ has joined #archiveteam-bs
22:37 ^🔗	JAA	Thanks, looks interesting. Right now, I'm just trying to get the job done so I can go to bed, but I'll definitely have to look at those projects again.
22:38 ^🔗	JAA	(I believe that I've had issues with installing Tornado at some point, and I don't really want to go down that rabbit hole right now.)
22:38 ^🔗	ZexaronS	Just a headsup people http://news.berkeley.edu/2017/03/01/course-capture/
22:39 ^🔗	JAA	ZexaronS: yup: http://archiveteam.org/index.php?title=UC_Berkeley_Course_Captures :-)
22:45 ^🔗	JAA	By the way, regarding --no-clobber/-nc: "WARC output does not work with --no-clobber, --no-clobber will be disabled."
22:46 ^🔗		bwn has quit IRC (Read error: Operation timed out)
22:48 ^🔗	ZexaronS	Sorry, im late, the newsite i looked at is like 2 weeks late too
22:54 ^🔗		bwn has joined #archiveteam-bs
22:57 ^🔗	JAA	Uh oh, wget eats all my memory until it's killed by the kernel...
23:00 ^🔗	JAA	I just noticed that Tornado is also a dependency of seesaw and thus already installed on my machine. Nevermind that earlier comment. I'll try wpull now.
23:01 ^🔗		Stilett0 has joined #archiveteam-bs
23:01 ^🔗		Stilett0 is now known as Stiletto
23:06 ^🔗		bwn has quit IRC (Ping timeout: 244 seconds)
23:07 ^🔗	JAA	Yeah, that's not working too well either. Hitting https://github.com/chfoo/wpull/issues/349 among others
23:14 ^🔗		bwn has joined #archiveteam-bs
23:16 ^🔗	JAA	Looks like the wget-lua build with --truncate-output works. Memory usage is still increasing with time, but not as drastically.
23:26 ^🔗	JAA	... but that doesn't download the same files as without that option.
23:37 ^🔗	JAA	Alright, I give up. Here's the list of the pages in ArchiveBot's DMOZ grab which are an error 420 page (cf. my messages around 11:45 UTC) if anyone with more experience than me wants to take a shot: https://ghostbin.com/paste/2ejhg
23:41 ^🔗		BlueMaxim has quit IRC (Read error: Operation timed out)

irclogger-viewer