#archiveteam-bs 2013-12-02,Mon

↑back Search

Time	Nickname	Message
03:42 ^🔗	kyan	Once I have a WARC/megawarc, how best to extract outbound URLs from it??
03:43 ^🔗	kyan	(and internal links)
03:52 ^🔗	dashcloud	is there anything in the .cdx file of use for you?
03:52 ^🔗	ivan`	use hanzo warc-tools to get the response bodies, then parse them with html5lib/lxml/beautifulsoup/whatever the pythonistas are using now
03:54 ^🔗	ivan`	or if you want something super-terrible to extract <a href="blah where the full target is on the same line, you can use zgrep -o on the .warc.gz
03:55 ^🔗	ivan`	(will not actually work across chunk boundaries, don't rely on it)
03:56 ^🔗	kyan	(that would get resources & some javasrcipt links, which would be a plus)
03:57 ^🔗	ivan`	I don't know about it
03:59 ^🔗	kyan	Thanks for the warc-tools pointer, that's definitely handy :)
04:04 ^🔗	dashcloud	there's a nice wiki page here: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem on various tools for WARCs
04:14 ^🔗	kyan	dashcloud, thanks, sweet :)
04:14 ^🔗	kyan	this is very useful
09:21 ^🔗	*	midas1 stabs magento in the face
09:25 ^🔗	BlueMax	...there's no magneto here
09:36 ^🔗	midas1	there is on my servers
09:36 ^🔗	midas1	and i prefer to stab it in the face, that way it sees who is stabbing it
11:02 ^🔗	joepie91	I'm thinking of improving my pastebin scraper so that A. it will live-crawl multiple pastebins and B. it will offer websockets/0mq streams of pastes as they are crawled
11:02 ^🔗	joepie91	like, a realtime feed of pastes
11:02 ^🔗	joepie91	shouldn't be too hard
11:03 ^🔗	joepie91	and I'd imagine you can do a lot of fun things with that :)
11:09 ^🔗	Smiley	could be fun
11:34 ^🔗	SketchCow	The Hilton did a $600 charge against my card. Not cool, Hilton
11:39 ^🔗	joepie91	SketchCow: :(
11:55 ^🔗	SketchCow	Bad, bad Hilton
12:30 ^🔗	midas1	did you empty the minibar?
12:31 ^🔗	midas1	if not, do it anyway
12:41 ^🔗	joepie91	lol
12:41 ^🔗	joepie91	"I paid for it - now I'll make sure that I make use of it"
13:16 ^🔗	midas1	indeed joepie91
13:16 ^🔗	midas1	"fuck this, im getting drunk @ 600 dollar"
15:14 ^🔗	GLaDOS	http://scr.terrywri.st/1385996531.png i have no idea what im doing
15:17 ^🔗	midas1	wot wot GLaDOS
15:18 ^🔗	GLaDOS	the faces, oh god the faces
15:19 ^🔗	BiggieJon	glad that was tiny, cuz I'm guessing ti was NSFW :)
15:19 ^🔗	GLaDOS	http://scr.terrywri.st/yolo.png here you go, full size at 10mb
15:20 ^🔗	BiggieJon	they all look sooo happy :)
15:21 ^🔗	midas1	10MB...wtf
15:21 ^🔗	midas1	MY LORD THE FACES!
15:21 ^🔗	GLaDOS	and this is why you never let me down a pepsi when im sleep deprived
15:21 ^🔗	GLaDOS	faceswap ALL the people.
15:22 ^🔗	midas1	i'm no person that believes in god, but hell, this is horrible, i would almost start to pray if i knew how
15:22 ^🔗	GLaDOS	lets play this fun game called spot the original!
15:23 ^🔗	midas1	the girl 4th of the right? :P
15:23 ^🔗	GLaDOS	nope
15:23 ^🔗	midas1	3rd of the right second row?
15:24 ^🔗	midas1	was it a girl? :P
15:24 ^🔗	GLaDOS	nope, and yep
15:25 ^🔗	midas1	ah yes! i got her
15:25 ^🔗	midas1	it was the 4th from the left
15:25 ^🔗	GLaDOS	nope.
15:25 ^🔗	midas1	hahaha
15:25 ^🔗	midas1	the one on the right
15:25 ^🔗	midas1	last one
15:25 ^🔗	GLaDOS	yeah, its her
15:26 ^🔗	midas1	i missed her, im watching this picture on a potato as screen
15:26 ^🔗	GLaDOS	ah, that'd be why
15:27 ^🔗	midas1	it's a 19"screen with a res of 1024x768
15:27 ^🔗	midas1	so yeah, i can see 2 faces when it's full size
15:29 ^🔗	midas1	thank god for 100mbit@home
15:29 ^🔗	GLaDOS	and now, to fix the sleep deprivation, i sleep.
15:29 ^🔗	GLaDOS	o7
15:31 ^🔗	midas1	good night!
17:01 ^🔗	arkiver	if a website is blocked from being archived in the robots.txt
17:01 ^🔗	arkiver	is it then still downloaded by the Wayback machine, but not shown?
17:01 ^🔗	arkiver	or is it not downloaded at all
17:04 ^🔗	balrog	it's not downloaded at all while it is blocked via robots.txt
17:05 ^🔗	balrog	old versions are retained but not shown
17:06 ^🔗	arkiver	hmm oke
17:06 ^🔗	arkiver	I'm going to search for robots.txt blocked pages as well then
17:07 ^🔗	arkiver	for archival for the IA
17:07 ^🔗	arkiver	oke = ok*
18:03 ^🔗	ivan`	arkiver: it would be interesting to grab robots.txt for every domain on the 'net and search for those that block IA or block all unknown bots
18:05 ^🔗	DFJustin	someone here was downloading all robots.txt
18:07 ^🔗	balrog	IA does grab robots.txt
18:11 ^🔗	Schbirid	<@DFJustin> someone here was downloading all robots.txt
18:11 ^🔗	Schbirid	you rang
18:12 ^🔗	Schbirid	only top 10000 alexa sites
18:23 ^🔗	DFJustin	can you easily filter for ones that block ia_archiver or *
18:25 ^🔗	Schbirid	sure, let's see
18:25 ^🔗	Schbirid	err, well, i cant
18:25 ^🔗	Schbirid	only grep
18:25 ^🔗	Schbirid	i never found a good parser so i never did anything with them
18:30 ^🔗	Schbirid	i am running: grep -ER -A 1 "(ia_archiver\|User-agent: \)"
18:31 ^🔗	Schbirid	for "some" hits
18:34 ^🔗	Schbirid	https://pastee.org/88nu6
18:45 ^🔗	DFJustin	95 hyves.nl
20:01 ^🔗	ivan`	I'm submitting a few of those to archivebot
20:02 ^🔗	arkiver	so
20:02 ^🔗	ivan`	if you see anything remotely interesting please do the same
20:02 ^🔗	arkiver	do we have a list of blocked websites?
20:02 ^🔗	arkiver	I got a few terabytes free here
20:02 ^🔗	arkiver	so I can still downlad quite some websites
20:02 ^🔗	arkiver	and then upload them
20:02 ^🔗	arkiver	:)
20:02 ^🔗	ivan`	not all of https://pastee.org/88nu6 are blocked but there's a lot
20:02 ^🔗	ivan`	arkiver: do you have upstream?
20:03 ^🔗	arkiver	?
20:03 ^🔗	arkiver	nope, what is it upstream?
20:03 ^🔗	ivan`	how fast can you upload?
20:03 ^🔗	arkiver	well
20:03 ^🔗	arkiver	let's see
20:03 ^🔗	arkiver	download speed: 7 - 8 Megabyte per second
20:04 ^🔗	arkiver	upload speed: 700 - 800 Kilobyte per second
20:04 ^🔗	arkiver	so I think that should be ok
20:04 ^🔗	arkiver	buuut
20:04 ^🔗	arkiver	what's upstream?
20:05 ^🔗	ivan`	"In computer networking, upstream refers to the direction in which data can be transferred from the client to the server (uploading)."
20:05 ^🔗	arkiver	ah nope
20:05 ^🔗	ersi	upstream == upload
20:05 ^🔗	ivan`	how much RAM do you have? wondering if you could run an archivebot pipeline to do archivebot jobs
20:06 ^🔗	arkiver	I am downloading everything with heritrix 3.1.1, then uplaoding it to the archive and then sending an email to jason to move the files to the wayback machine
20:06 ^🔗	arkiver	my ram?
20:06 ^🔗	arkiver	4 gb right now
20:06 ^🔗	arkiver	but
20:06 ^🔗	arkiver	soon I'm going to buy a new computer, which will have around 16GB SDRAM
20:06 ^🔗	arkiver	(like 70% sure I'm going to buy it)
20:06 ^🔗	arkiver	also
20:06 ^🔗	ivan`	you know you want 32GB ;)
20:07 ^🔗	arkiver	I don't have my computer on 24/7
20:07 ^🔗	ivan`	heh
20:07 ^🔗	arkiver	you got 32??
20:07 ^🔗	arkiver	O.o
20:07 ^🔗	ivan`	I have 96GB in a box but my upstream is 160KB/s
20:07 ^🔗	arkiver	ah
20:07 ^🔗	*	ersi pokes the VM host machine with 256GB RAM
20:07 ^🔗	arkiver	my ram is lower but upstream faster
20:07 ^🔗	arkiver	:P
20:07 ^🔗	arkiver	-.-
20:07 ^🔗	arkiver	ok ok ok
20:07 ^🔗	ivan`	why do you turn off your computer?
20:07 ^🔗	ersi	I turn off most of my shit as well
20:08 ^🔗	arkiver	I now know that my ram isn't high guys... -.-
20:08 ^🔗	arkiver	yep
20:08 ^🔗	arkiver	at night
20:08 ^🔗	arkiver	it's in my room
20:08 ^🔗	ersi	that machine isn't mine
20:08 ^🔗	arkiver	and making noice
20:08 ^🔗	ersi	it's a machine at work
20:08 ^🔗	arkiver	and it's irritating then...
20:08 ^🔗	arkiver	yeah
20:08 ^🔗	arkiver	as ersi says
20:08 ^🔗	ersi	My laptop got 8GB and my workstation got 4GB
20:08 ^🔗	ivan`	do you have a closet? perfect place for a computer
20:08 ^🔗	arkiver	-.-
20:08 ^🔗	ersi	I do have a closet "server" machine though ^_^
20:08 ^🔗	arkiver	not gonna place my pc in there
20:09 ^🔗	arkiver	in my closet...
20:09 ^🔗	arkiver	so oke
20:09 ^🔗	arkiver	I'm going through that list right now and looking at the robots.txt
20:09 ^🔗	arkiver	and then selecting the websites to download
20:09 ^🔗	ivan`	closet blocks like 30dB
20:10 ^🔗	arkiver	yeah well
20:10 ^🔗	arkiver	nah
20:10 ^🔗	arkiver	I'm happy like this
20:10 ^🔗	arkiver	maybe some other time
20:10 ^🔗	arkiver	so
20:10 ^🔗	arkiver	which sites from the list are already downloaded?
20:10 ^🔗	arkiver	or downloading
20:13 ^🔗	arkiver	http://www.insideview.com/robots.txt
20:14 ^🔗	arkiver	# bad crawlers
20:14 ^🔗	arkiver	Disallow: /
20:14 ^🔗	arkiver	User-agent: *
20:14 ^🔗	arkiver	"bad crawlers" O.o :'(
20:14 ^🔗	Schbirid	:D
20:15 ^🔗	Schbirid	is there any value in keeping the daily 1m top sites zip from alexa? i want to clean up
20:15 ^🔗	arkiver	I don't know
20:15 ^🔗	arkiver	but did you create that list of websites?
20:15 ^🔗	ersi	Schbirid: How large is the data?
20:15 ^🔗	ersi	arkiver: alexa.com provides a list of 1m top sites
20:16 ^🔗	arkiver	yes
20:16 ^🔗	arkiver	but can we automatically check the robots.txt?
20:16 ^🔗	arkiver	also
20:16 ^🔗	arkiver	this site is also blocked:
20:16 ^🔗	arkiver	http://svs.gsfc.nasa.gov/
20:16 ^🔗	ersi	"Free download" from http://www.alexa.com/topsites -> http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
20:16 ^🔗	ersi	Well, sure..
20:16 ^🔗	arkiver	many GB's of create visualisation videos
20:16 ^🔗	arkiver	just blocked... :(
20:16 ^🔗	arkiver	if gone everything is gone
20:17 ^🔗	Schbirid	y
20:17 ^🔗	Schbirid	~10M per da
20:17 ^🔗	Schbirid	i got 8G here
20:17 ^🔗	ersi	~10MB/day?
20:18 ^🔗	arkiver	8GB of robots.txt's?
20:18 ^🔗	Schbirid	nah, 8G of the 1m file
20:18 ^🔗	Schbirid	3G of robots files :D
20:19 ^🔗	arkiver	ah
20:19 ^🔗	Schbirid	365 7z files in one item sound idiotic or ok? i want to dump them to IA
20:19 ^🔗	arkiver	and are they already looked at for if IA is blocked?
20:19 ^🔗	Schbirid	no
20:19 ^🔗	Schbirid	dumb daily downloading
20:20 ^🔗	Schbirid	https://github.com/ArchiveTeam/robots-relapse is some version, not sure what exactly that one does
20:21 ^🔗	arkiver	already downloaded 2.5 GB of www.webmonkey.com/
20:22 ^🔗	arkiver	hmm
20:22 ^🔗	arkiver	maybe it would be helpful to create a list of websites people from the Archiveteam are currently downloading at home?
20:22 ^🔗	arkiver	maybe in the wiki?
20:22 ^🔗	arkiver	and that we regularly update it?
20:26 ^🔗	ivan`	better to just get everything to IA
20:27 ^🔗	arkiver	I mean that we have more organised list of what is done and what still needs to be done?
20:28 ^🔗	arkiver	and that we take some website, put our name behind them if we are working on them
20:28 ^🔗	arkiver	know what I mean
20:28 ^🔗	arkiver	?
20:29 ^🔗	ivan`	once you're grabbing many domains per day, I don't think you'll be motivated to keep it in sync
20:30 ^🔗	arkiver	hmm oke then
20:30 ^🔗	arkiver	but some domains will be grabbed twice or more maybe...
20:30 ^🔗	arkiver	well
20:30 ^🔗	ersi	so? :)
20:30 ^🔗	arkiver	:P
20:30 ^🔗	ersi	disk is cheap
20:30 ^🔗	arkiver	and everything is going to IA
20:30 ^🔗	arkiver	so not on our disk
20:46 ^🔗	w0rp	Redundancy is good for archiving.
21:11 ^🔗	arkiver	is anyone else here usin heritrix?
21:11 ^🔗	arkiver	I'm having a problem right now... :(
21:17 ^🔗	ersi	I always recommend the following: Write about the problem instead of asking to ask about asking to ask
21:17 ^🔗	ersi	I'm not running heritrix. What's the problem?
21:18 ^🔗	godane	i think my bluray player may hate me
21:18 ^🔗	godane	*bluray burner
21:18 ^🔗	arkiver	I'm keep getting this error:
21:18 ^🔗	arkiver	2013-12-02T21:15:01.002Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.UnsatisfiedLinkError: Error looking up function 'link': Kan opgegeven procedure niet vinden.
21:18 ^🔗	godane	i see it to burn at speed 4x
21:18 ^🔗	arkiver	when I try to swtart a job from the checkpoint
21:18 ^🔗	godane	and now its trying to burn at 10x
21:18 ^🔗	ersi	What does the Dutch error message mean?
21:19 ^🔗	arkiver	Can't find given procedure
21:20 ^🔗	ersi	Hm. Has it worked before?
21:20 ^🔗	arkiver	well
21:20 ^🔗	arkiver	It suddenly worked on eitme
21:20 ^🔗	arkiver	time
21:20 ^🔗	arkiver	but then not
21:20 ^🔗	arkiver	and before that time also not
21:21 ^🔗	arkiver	if I try it one time and I get the before error and then try it a second it gives me this error:
21:21 ^🔗	arkiver	2013-12-02T21:20:48.918Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.IllegalStateException: com.sleepycat.je.EnvironmentFailureException: (JE 4.1.6) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 4.1.6) K:\Internet-Archive\heritrix-3.1.1\bin\.\jobs\test\state fetchTarget of 0x0/0xbf parent IN=2 IN
21:21 ^🔗	arkiver	class=com.sleepycat.je.tree.BIN lastFullVersion=0x1/0x540 parent.getDirty()=false state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed. (in thread 'test launchthread')
21:21 ^🔗	arkiver	and yeah
21:21 ^🔗	arkiver	the problem is that a log file is missing
21:21 ^🔗	arkiver	so I checked it
21:21 ^🔗	arkiver	it is missing a 00000000.jdb file
21:21 ^🔗	arkiver	now
21:22 ^🔗	arkiver	I opened that folder and right before I click to start the job again 00000000.jdb is still there
21:22 ^🔗	arkiver	but right after I click it 00000000.jdb dissapears and then I get the error
21:22 ^🔗	arkiver	as if it is first deleting it and then trying to open it...
21:23 ^🔗	arkiver	instead of first opening and then deleting it
21:53 ^🔗	godane	so it looks like this disc is doing better then the last one
21:54 ^🔗	godane	not saying everything is ok yet
21:55 ^🔗	godane	the video is still like last time
21:55 ^🔗	godane	but filesystem can be viewed
21:56 ^🔗	godane	and the video does play
21:56 ^🔗	godane	just fastforwarding is slower then normal
22:01 ^🔗	yipdw	oh neat, DigitalOcean has a second datacenter in Amsterdam
22:18 ^🔗	godane	so good news
22:18 ^🔗	godane	turns out i missing type my burning script
22:18 ^🔗	ersi	http://fortvv2.capitex.se/beskrivning.aspx?guid=46PQ44OP65VBJM2B&typ=CMFastighet
22:19 ^🔗	ersi	want
22:19 ^🔗	ersi	so bad
22:19 ^🔗	godane	it had --speed=4 instead of -speed=4
22:19 ^🔗	ersi	(Old Swedish Military fortification/base with tunnels and everything)
23:17 ^🔗	godane	SketchCow: at some point i will be uploading all the pdfs i got from ftp.qmags.com to my godaneinbox
23:17 ^🔗	godane	that way we can make tons of collections for them
23:19 ^🔗	godane	there are like magazines about cleaning rooms
23:19 ^🔗	godane	for making computer chips
23:20 ^🔗	godane	this doesn't take of the table anyone want into put a full ftp tar of it up
23:33 ^🔗	godane	i copied a html file of the ftp root index and made a list of pdf files to grab
23:33 ^🔗	godane	this way i don't have download every exe and sea.hqx file

irclogger-viewer