#archiveteam 2013-06-24,Mon

↑back Search

Time	Nickname	Message
00:10 ^🔗	arrith1	RichardG: https://github.com/kngenie/ias3upload
00:10 ^🔗	RichardG	thanks
00:11 ^🔗	arrith1	RichardG: err, seems people recommend https://github.com/kimmel/ias3upload
00:14 ^🔗	omf_	I maintain https://github.com/kimmel/ias3upload which has bug fixes
00:23 ^🔗	RichardG	never really used IA before, you say I should send it to community media so it's a better mirror than dropbox?
00:23 ^🔗	RichardG	or contact them to get a collection?
00:27 ^🔗	omf_	RichardG, how large in file size is all of Google Answers
00:29 ^🔗	RichardG	313 MB in 7z ultra
00:29 ^🔗	RichardG	but should send it uncompressed
00:29 ^🔗	RichardG	which is like 3.4 GB
00:29 ^🔗	RichardG	3.5 actually
00:29 ^🔗	omf_	yeah you don't need a collection for that. An item can hold 20-50gb
00:38 ^🔗	RichardG	I hope it's ok I saved without any site requirements, Google Answers is all text only anyways
00:40 ^🔗	RichardG	compressing the cdxs for upload
00:45 ^🔗	RichardG	feel like I'm doing something wrong if it falls in the Community Texts section.. :\
00:46 ^🔗	arrith1	RichardG: you can ask for your own collection, or maybe there's a google collection
00:48 ^🔗	RichardG	awesome, I'm stuck now!
00:48 ^🔗	omf_	Users only have access to upload to a few collections unless they are added to others
00:49 ^🔗	arrith1	RichardG: stuck waiting for a reply to mail the IA?
00:49 ^🔗	RichardG	aaand my browser froze
00:51 ^🔗	arrith1	RichardG: if you need someone to run that crawler for the blogs, i can. if it's bogging down your computer
00:51 ^🔗	RichardG	nah... I just need to know what to do
00:52 ^🔗	RichardG	I have a collection in community media, with a few files uploaded until I panic-killed the browser
00:52 ^🔗	RichardG	I wasn ever into archiving before :P
00:55 ^🔗	RichardG	I'm holding up on continuing to upload the Google Answers stuff until I come up with a solution...
00:55 ^🔗	RichardG	can't find a collection, only spam
00:55 ^🔗	balrog	whois RichardG
00:55 ^🔗	balrog	whoops :P
00:55 ^🔗	balrog	sorrty
00:56 ^🔗	balrog	sorry**
00:56 ^🔗	balrog	spam where?
00:57 ^🔗	RichardG	nah, it might just be my inability to find a collection to fit the Google Answers stuff
00:57 ^🔗	balrog	hmm... one of the ArchiveTeam collections?
00:57 ^🔗	RichardG	basically: I have a few files accidentally in the opensource media section, and am looking for a proper home
00:57 ^🔗	RichardG	could be, if I could move my original submission
00:58 ^🔗	balrog	hm, I think SketchCow and underscor have access to move things on IA
00:58 ^🔗	RichardG	tell them to move this
00:58 ^🔗	RichardG	http://archive.org/details/google-answers-archive
00:59 ^🔗	RichardG	if it goes through, I"ll continue uploading tomorrow since it's late and 1.6 GB
00:59 ^🔗	balrog	it probably belongs in this collection: http://archive.org/details/archiveteam
00:59 ^🔗	RichardG	wish I could upload overnight but tight on power bill
00:59 ^🔗	RichardG	yeah
01:03 ^🔗	RichardG	well, I'll go to sleep now, if that could be moved I would appreciate, then I will continue uploading the archive, thanks for the help
01:29 ^🔗	arkhive	It's going to be hard to save all of the Yahoo! Answers data when(not if) Yahoo! decides to shut it down.
01:30 ^🔗	arrith1	arkhive: preemptive action, especially crawls, for big sites is good
02:19 ^🔗	omf_	Anyone a system administrator at an ISP?
02:21 ^🔗	BlueMax	why do you ask, omf_?
02:25 ^🔗	omf_	Cause it would be nice to have access to the zone files of well used DNS servers
02:25 ^🔗	arrith1	i might have something like that in a few months
02:26 ^🔗	arrith1	i'll make a note and get back to you
02:26 ^🔗	omf_	We always seem to be looking for domain names and subdomain names
02:27 ^🔗	BlueMax	I may, keyword may, know someone who has access to that sort of thing, but this is no guarantee
02:29 ^🔗	BlueMax	damn he's not online
02:29 ^🔗	omf_	And having someone at an ISP would give us the access level we need to simplify most of these problems. Of course I could set up a DNS server and do it myself but ISP's servers get heavy use so the data is already there
02:58 ^🔗	namespace	omf_: Makes sense.
03:00 ^🔗	omf_	I used to work at an ISP so I know the data is there and easier to collect as an ISP
03:00 ^🔗	Aranje	is this like query statistics or what
03:02 ^🔗	omf_	no. It is about discovering domain names without having to crawl sites looking for them
03:03 ^🔗	omf_	A system admin at a University would have the access needed as well
03:03 ^🔗	namespace	Aranje: Are you around when Archive Team decides to grab a site?
03:04 ^🔗	namespace	If you are then you'd know that the first thing they need to figure out is what to grab.
03:04 ^🔗	namespace	Most big sites are a mess of domains and subdomains.
03:04 ^🔗	arrith1	crawling takes up a lot of valuable time
03:04 ^🔗	namespace	Being able to decide to archive a site and have it Just Work (TM) would be a real help.
03:07 ^🔗	namespace	http://en.wikipedia.org/wiki/Zone_file
03:11 ^🔗	Aranje	so you're looking to be able to grab a zone file of the site to see if the subdomains are listed out? If yes, why would an ISP have access to that... doesn't only the site hosting have all of that info in one place?
03:12 ^🔗	Aranje	this concept somewhat defies my (weak) knowledge of how dns is done
03:13 ^🔗	namespace	...
03:13 ^🔗	namespace	Aranje: Think about this for a minute.
03:13 ^🔗	namespace	When you type in www.google.com into your browser, where does that request go to?
03:14 ^🔗	Aranje	my computer
03:14 ^🔗	namespace	...
03:14 ^🔗	namespace	Oh dear.
03:14 ^🔗	Aranje	because I have unbound running in caching mode
03:14 ^🔗	namespace	Oh.
03:15 ^🔗	namespace	Well after that.
03:15 ^🔗	Aranje	unbound asks... I forget one of the japanese ISP's that has a resolver in san jose
03:15 ^🔗	Aranje	who, if it doesn't have a copy of the record goes and recurses down from . to get it
03:16 ^🔗	namespace	.?
03:16 ^🔗	Aranje	the root
03:16 ^🔗	namespace	Oh.
03:16 ^🔗	namespace	Nevermind.
03:16 ^🔗	Aranje	I'm not retarded, I just haven't eaten recently
03:16 ^🔗	Aranje	lol
03:16 ^🔗	namespace	I thought I had this in my head, but now I'm confused.
03:17 ^🔗	Aranje	I also run the dns shit for our hosting company, which is why I was interested in the first place
03:17 ^🔗	Aranje	likely I can't help, but if I can get a handle on exactly what is needed maybe I know someone who can
03:19 ^🔗	omf_	An ISP can request the zone file for all .org sites for example from companies like godaddy and verisign
03:19 ^🔗	Aranje	for local caching, I see
03:19 ^🔗	omf_	that is not something normal people can do because it takes resources to generation that file
03:20 ^🔗	omf_	I have tried filling out the paperwork to make it happen
03:20 ^🔗	omf_	They always say the free service is for ISPs not normal people
03:20 ^🔗	*	Aranje nods
03:20 ^🔗	*	Aranje understands now
03:20 ^🔗	omf_	Lets say I got the TLD .org zone file, I would not have 3+ million domain names
03:21 ^🔗	omf_	based on previous published numbers
03:22 ^🔗	omf_	There is no way to get all domain names everywhere but it is possible to get blocks of them measured in the millions
03:23 ^🔗	arrith1	just have to put together 'omf ISP LLC' ;)
03:24 ^🔗	Aranje	that'd be a fun project
03:24 ^🔗	omf_	We just need to ask around. There are 170 peeps in here and in that network of knowing people is probably a person who can help us.
03:24 ^🔗	Aranje	sucking in a list of all domain names
03:25 ^🔗	omf_	Also someone who works at a big company and runs a DNS cache could find us sites as well
03:25 ^🔗	*	Aranje nods
03:26 ^🔗	omf_	ISPs are best, universities usually get that access too, then large corps with many users would be next
03:27 ^🔗	omf_	There are places we could buy this information but fuck that
03:27 ^🔗	omf_	oooohh another idea. Someone who works at a hosting company
03:27 ^🔗	omf_	Amazon, Joyent, Digital Ocean, Linode, etc... they all provide multi-level DNS and caching
03:28 ^🔗	Aranje	riverbed was the company I was thinking of
03:28 ^🔗	Aranje	I have a friend there
03:29 ^🔗	omf_	Aranje, let me ask you about unbound
03:29 ^🔗	Aranje	sure, I'll answer if I can
03:29 ^🔗	omf_	I was thinking having that running on a crawling server could really help speed up large scale grabs
03:30 ^🔗	omf_	Do you see speedups using it locally
03:30 ^🔗	Aranje	the short answer is yes
03:31 ^🔗	Aranje	I use it because I have charter as an ISP and I can't trust that dns requests will actually succeed, even to other providers
03:31 ^🔗	Aranje	so I cache as much locally as possible
03:31 ^🔗	Aranje	and there's no caching like local caching, for perf
03:31 ^🔗	omf_	who do point to for peering? opendns?
03:32 ^🔗	Aranje	lemme figure it out, it's a japanese isp
03:32 ^🔗	Aranje	they have an IP in san jose with very very good numbers
03:32 ^🔗	omf_	for people following along I recommend reading the short info on passive DNS here - https://security.isc.org/
03:33 ^🔗	Aranje	ahh
03:33 ^🔗	Aranje	it's ntt
03:33 ^🔗	Aranje	129.250.35.250 is the IP I use
03:34 ^🔗	Aranje	officially x.ns.gin.ntt.net
03:34 ^🔗	Aranje	I'm halfway between LA and SF (in San Luis Obispo) and I've had great luck with stuff in SJC
03:34 ^🔗	omf_	Opera has done a lot of working on building domain lists
03:35 ^🔗	Aranje	there's a dns performance test utility that I ran, and they got not the best speeds but the most stable
03:36 ^🔗	Aranje	I used to run unbound in recursive caching mode, but I found switching to querying someone else gave me another drop in query time
03:37 ^🔗	Aranje	If you have an archive box someplace, it might be fine to just find a dns server in the same datacenter. If that's not available, unbound is a great option.
03:37 ^🔗	namespace	omf_: So what you're saying is that you just need a copy of the file?
03:37 ^🔗	omf_	Copies of TLD zones files are the best solution since they have "everything"
03:38 ^🔗	omf_	I just found another service that will offer it but only to researchers and shit
03:38 ^🔗	omf_	If someone from IA applied they could probably get access. underscor
03:38 ^🔗	Aranje	you can tell unbound to refetch both keys and full records ahead of their expiry as well, greatly reducing having to wait on queries if they're near ttl expiry
03:38 ^🔗	namespace	omf_: I was going to say, just ask anyway.
03:38 ^🔗	namespace	Godaddy/etc have a cheap ass profit motive not to.
03:39 ^🔗	namespace	The free for researchers guys might actually just give it to you if you ask.
03:39 ^🔗	omf_	This is who I was just referring to https://dnsdb.isc.org/
03:40 ^🔗	namespace	I could ask around on IRC if you want.
03:40 ^🔗	omf_	please
03:43 ^🔗	omf_	I already have a few million domain names collected. I plan to build this list up and put it on IA
03:43 ^🔗	omf_	Most of the URL lists we put up already for sites that closed
03:44 ^🔗	Aranje	omf_:) if you continue being interested, I can give you my unbound config and the list of caveats for it :)
03:44 ^🔗	namespace	Wait, a large should-be-public dataset that you plan to leak for the benefit of the public?
03:44 ^🔗	omf_	please Aranje I was looking to bundle unbound into a general crawler VM
03:44 ^🔗	omf_	yep
03:44 ^🔗	Aranje	oh, perfect
03:44 ^🔗	namespace	You can probably find somebody willing to help with that sort of marketing.
03:45 ^🔗	Aranje	I'll remove some of the caveats and pass it over. How much ram will the VM have?
03:45 ^🔗	namespace	Even set up an anonymous dump site.
03:45 ^🔗	omf_	ooh I should email Malamund
03:45 ^🔗	namespace	dnsleaks.org
03:45 ^🔗	Aranje	:3
03:45 ^🔗	omf_	for testing purposes 1gb
03:45 ^🔗	omf_	but I test it on butt providers with 8gb ram
03:47 ^🔗	Aranje	okay, I'll tune some of the numbers down a bit. I think the way I have it set up it can use up to 256mbish of ram
03:47 ^🔗	Aranje	but I cache for the house
03:47 ^🔗	Aranje	it also never approaches that
03:47 ^🔗	Aranje	:D
03:48 ^🔗	namespace	Does Archive Team have a blog?
03:48 ^🔗	omf_	malware domain lists are already public - http://www.malwaredomainlist.com/mdl.php
03:48 ^🔗	namespace	This sort of thing is why it's a good idea to have one.
03:48 ^🔗	namespace	Even with a thousand subscribers you'd probably have what you wanted in a day or two if you asked on it.
03:48 ^🔗	Aranje	we have jason and well followed twitters
03:49 ^🔗	namespace	Eh, I wouldn't want to bother Jason unless it was really necessary.
03:50 ^🔗	namespace	#jengaforxanga
03:50 ^🔗	*	Aranje makes assumptions about vm's and tunes accordingly
03:50 ^🔗	omf_	xanga and google reader would both benefit from this work
03:52 ^🔗	Aranje	these vm's... are they debian or ubuntu?
03:52 ^🔗	omf_	neither
03:52 ^🔗	Aranje	m
03:53 ^🔗	Aranje	homebrew? or just a centos or something
03:53 ^🔗	omf_	nope
03:53 ^🔗	*	Aranje wishes to tailor his config to the package likely to be installed
03:53 ^🔗	namespace	omf_: Just say it so I don't have to scan my VM.
03:54 ^🔗	Aranje	should I prepare a full zip with all the necessary files?
03:54 ^🔗	Aranje	so it runs in its own directory and gives zero fucks?
03:54 ^🔗	*	Aranje grins
03:54 ^🔗	omf_	Since most butt providers are dumb I go with the newest Linux they offer. This is usually Fedora but I prefer opensuse since it stays up to date and more stable than most of everything else
03:54 ^🔗	omf_	why would it matter
03:54 ^🔗	namespace	butt providers?
03:55 ^🔗	Aranje	the location of eg: root.hints and root.key changes
03:55 ^🔗	Aranje	for dnssec validation
03:55 ^🔗	omf_	check this out namespace - https://github.com/panicsteve/cloud-to-butt it is a running joe
03:55 ^🔗	omf_	joke
03:55 ^🔗	omf_	I can handle that Aranje
03:56 ^🔗	namespace	That's great.
03:56 ^🔗	omf_	we also call them clown hosting
03:56 ^🔗	namespace	Because they're not even funny/
03:56 ^🔗	namespace	*?
03:56 ^🔗	omf_	because you are a clown for using one
03:56 ^🔗	namespace	Ah.
03:57 ^🔗	namespace	I feel sorry for clows.
03:57 ^🔗	namespace	*clowns
03:57 ^🔗	omf_	because people think "cloud hosting" solves all their problems
03:57 ^🔗	namespace	Pop culture destroyed that occupation.
03:57 ^🔗	omf_	I found 3 studies on 1+ million domains and no source data provided.
03:58 ^🔗	Aranje	expected singlecore vm's?
03:58 ^🔗	omf_	multiple cores
03:58 ^🔗	Aranje	2 seem like a sane default?
03:59 ^🔗	omf_	yes
03:59 ^🔗	Aranje	(governs number of threads)
04:15 ^🔗	Aranje	omf_:) do you want them looking up the addresses themselves or how I do it with basically being a caching proxy
04:17 ^🔗	omf_	caching
04:18 ^🔗	omf_	anything that makes speed the priority
04:19 ^🔗	omf_	just got 335,902 more domains
04:19 ^🔗	Aranje	yep. I've run back over my config (found some problems with the one I wrote Ha!) and reread the whole man page while doing so. This one'll have you covered.
04:19 ^🔗	omf_	I am writing a crawler right now to collect domain lists from sale sites
04:19 ^🔗	omf_	thanks
04:25 ^🔗	omf_	Archive Team measures domain name collection in hundreds of thousands, anything less would be uncivilized :)
04:28 ^🔗	godane	i'm making a update grab of thefeed on g4tv.com
04:28 ^🔗	omf_	I am going to put a crawler together for godaddy. I can get 500 domain names at a time
04:29 ^🔗	godane	look what i have found: http://www.telnetbbsguide.com/dialbbs/dialbbs.htm
04:30 ^🔗	namespace	godane: Most telnet BBS's are empty.
04:30 ^🔗	omf_	I think I just found a list of all active domains in 2011
04:30 ^🔗	arrith1	omf_: if you target the Warrior could probably get it very fast
04:32 ^🔗	omf_	jackpot bitches
04:33 ^🔗	omf_	I just got 90 million unique domain names
04:33 ^🔗	omf_	90,000,000 <- that is a lot of zeros
04:34 ^🔗	omf_	and that is just the .com list
04:35 ^🔗	omf_	14 million .org
04:40 ^🔗	omf_	So now I have a nice big clean dataset to share
04:45 ^🔗	omf_	this is going to take a while to download
05:00 ^🔗	Sue\|phone	Omf_: aranje lost power
05:02 ^🔗	omf_	oh
05:06 ^🔗	omf_	also got .au, .ru, .net, .info, .us, .ca, .de and others
05:21 ^🔗	arrith1	omf_: gj
05:26 ^🔗	namespace	Ugh, that moment when you've been grabbing for like a day and realize all your grabs are contentless.
05:27 ^🔗	arrith1	namespace: should really be a tool for that
05:28 ^🔗	arrith1	could use loose ml, something like 'these are good files, if later files differ significantly, notify me somehow'
05:30 ^🔗	namespace	Yeah.
05:30 ^🔗	namespace	We need to code one up.
05:30 ^🔗	namespace	That was really frustrating.
05:31 ^🔗	namespace	I'm just glad I checked before I'd grabbed everything.
05:31 ^🔗	namespace	That would have been such a waste.
05:31 ^🔗	arrith1	maybe even have an option to be somewhat specific like "notify me if 2% differs, 5%, etc'
05:31 ^🔗	namespace	Eh.
05:31 ^🔗	namespace	You could make it even simpler.
05:31 ^🔗	arrith1	namespace: i'm actually wondering that now. Google Reader hasn't really ratelimited and is honest about http codes but it's too much to check manually in time
05:31 ^🔗	namespace	If the files start coming out sub a certain amount of memory let me now.
05:32 ^🔗	namespace	*know
05:32 ^🔗	arrith1	namespace: oh like smaller than some byte size/count
05:32 ^🔗	namespace	Like 9KB files probably aren't right.
05:32 ^🔗	arrith1	"less than 4 KB, 2 KB"
05:32 ^🔗	arrith1	ah, yeah, less than 1 MB or 0.5 MB even sometimes
05:33 ^🔗	namespace	Mine are less than a megabyte but probably more than ten kilobytes.
05:33 ^🔗	arrith1	that wouldn't even take ml, just a periodic thing and a notification system
05:33 ^🔗	namespace	Yeah, that's what I'm saying.
05:33 ^🔗	namespace	It could be a bash script.
05:34 ^🔗	arrith1	how to notify though is what i'm wondering. since people run things headless a lot
05:34 ^🔗	namespace	Though python or some such would be better for the sake of not having to deal with bash.
05:34 ^🔗	namespace	Email?
05:34 ^🔗	namespace	System beep?
05:34 ^🔗	arrith1	bash gets the job done :P
05:34 ^🔗	arrith1	maybe writing out an error message to a file and "touch STOP" ing
05:37 ^🔗	arrith1	or killing the process, if the process doesn't support "touch STOP"
05:38 ^🔗	arrith1	could have an analysis thing where if you have a few known good files, or a directory of known good files, scans them and suggests rounded values to use
05:39 ^🔗	arrith1	should use inotify on linux
05:54 ^🔗	arrith1	alright on my TODO list to write that in python. but no one feel like they shouldn't write one if they want to
08:14 ^🔗	namespace	Ugh. Page grabs work when I use wget, but a full grab gets me contentless crap.
08:14 ^🔗	namespace	*use wget on a single page
08:16 ^🔗	arrith1	namespace: some kind of ratelimiting?
08:16 ^🔗	namespace	arrith1: Maybe.
08:16 ^🔗	namespace	It's a vbulletin forum.
08:17 ^🔗	namespace	I'm waiting ten seconds between items.
08:17 ^🔗	arrith1	well test out the URLs in the browser, also could be sending diff stuff depending on user agent, could try the fx addon user agent switcher
08:17 ^🔗	namespace	Yeah, I was gonna try to use a different user agent.
08:17 ^🔗	arrith1	i know lots of forums send different data to user agent claiming to be crawlers, like will require registration unless the ua is googlebot
08:18 ^🔗	namespace	I am registered.
08:18 ^🔗	namespace	And a long-time member, for that matter.
08:18 ^🔗	arrith1	namespace: is your wget supplying cookies?
08:18 ^🔗	namespace	arrith1: Yup.
08:19 ^🔗	namespace	And yes they're still valid.
08:19 ^🔗	arrith1	welll, could just be a poorly coded user-agent hack thing. i'd use your browser UA and/or try fx user agent addon switcher
08:19 ^🔗	arrith1	namespace: changed your browser ua?
08:19 ^🔗	namespace	Hmm.
08:19 ^🔗	namespace	I could try changing my browser UA to test.
08:19 ^🔗	ivan`	what exactly is the contentless crap?
08:20 ^🔗	SilSte2	Hi @ all
08:20 ^🔗	SilSte2	Tunewiki is closing...
08:20 ^🔗	SilSte2	http://www.tunewiki.com/news/186/tunewiki-is-shutting-down
08:22 ^🔗	namespace	ivan`: A message you get when you look at the sites index.html
08:23 ^🔗	SilSte	and is there a problem with xanga? Stopped getting new items...
08:24 ^🔗	arrith1	SilSte: might be. a few users are reporting issues. xanga discussion in #jenga
08:24 ^🔗	arrith1	SilSte: could make a page on the ArchiveTeam wiki for tunewiki if you want
08:24 ^🔗	SilSte	I'm unsure if its important enought ^^
08:24 ^🔗	arrith1	namespace: hm so sometimes you're able to get pages from wget, but not successively
08:24 ^🔗	SilSte	And there are only 4 days left...
08:25 ^🔗	arrith1	SilSte: only thing that makes a site important enough is that people want to save it
08:25 ^🔗	arrith1	hmm
08:25 ^🔗	namespace	^ THis
08:26 ^🔗	namespace	I think my options were:
08:27 ^🔗	namespace	./wget --warc-file --no-parent --mirror -w 10 --limit-rate 56k --verbose --load-cookies
08:27 ^🔗	namespace	(URL's redacted for privacy reasons.)
08:30 ^🔗	arrith1	namespace: -erobots=off, also if you don't specify a ua then it does wget. i'd put money on vbulletin shipping with something to handle wget UAs
08:30 ^🔗	namespace	The robots.txt has nothing in it, basically.
08:31 ^🔗	namespace	Except for a 2 minute crawl limit.
08:31 ^🔗	namespace	Or something like that.
08:31 ^🔗	SilSte	wikisecretword?
08:31 ^🔗	arrith1	namespace: could set your delay time to 2min and set your UA to a normal firefox/chrome UA
08:31 ^🔗	arrith1	SilSte: yahoosucks
08:31 ^🔗	arrith1	SilSte: you didn't say the line btw :P
08:31 ^🔗	SilSte	thx
08:31 ^🔗	namespace	arrith1: I'll try that.
08:32 ^🔗	namespace	How do I get a firefox UA?
08:32 ^🔗	arrith1	google for "what is my UA" and pages will show you
08:32 ^🔗	namespace	I know how to switch it of course.
08:32 ^🔗	namespace	Ah, okay.
08:32 ^🔗	arrith1	so you can use your actual browser UA
08:33 ^🔗	arrith1	also like livehttpheaders fx plugin has that info, wireshark would also show it. might be some about:config thing that says it. also copying/pasting from firefox useragent switcher addon, but using your own is more stealthy if you visit a site a lot
08:36 ^🔗	namespace	At 2 mins a page this should take a few months. :P
08:37 ^🔗	arrith1	namespace: hmm, well one upside is you probably would be totally safe from triggering any ratelimiting heh
08:37 ^🔗	namespace	I'm more afraid that the connection would time out before I finish.
08:38 ^🔗	namespace	Remember, WARC has no timestamping.
08:38 ^🔗	arrith1	namespace: might want to save wget logs then
08:38 ^🔗	SilSte	namespace: arrith: what are you doing?
08:38 ^🔗	arrith1	namespace: btw you can do retries, if it times out then it'll retry
08:39 ^🔗	arrith1	SilSte: namespace is saving some forums he likes a lot, and i'm working on ways to get RSS/atom feed urls for the Google Reader effort. right now that means a crawler to get usernames from livejournal
08:40 ^🔗	SilSte	kk
08:40 ^🔗	Coderjoe	qw3rty+P3R50N4L
08:40 ^🔗	Coderjoe	hoshit
08:41 ^🔗	SilSte	I made a Wikisite
08:41 ^🔗	SilSte	http://www.archiveteam.org/index.php?title=TuneWiki
08:42 ^🔗	SilSte	I'm not familiar with building the warrior etc... just wanted to inform ^^
08:42 ^🔗	namespace	arrith1: Yeah, two minutes isn't happening.
08:42 ^🔗	arrith1	haha
08:43 ^🔗	namespace	Sadly it's not doable on a sub 1200 baud modem.
08:44 ^🔗	SilSte	I think the downloading of tunewiki should be kind of easy... there are list with the artists ;-)
08:44 ^🔗	arrith1	namespace: well you could try without a limit, if all it cares about is the UA. if you aren't downloading it a bunch then one quick dl shouldn't be too noticeable
08:44 ^🔗	namespace	This is my third attempt.
08:44 ^🔗	arrith1	SilSte: would you like to download tunewiki? with a single wget command you probably could grab it all
08:44 ^🔗	namespace	I think it might be a little noticeable.
08:44 ^🔗	SilSte	if you tell me how... i can try
08:45 ^🔗	arrith1	SilSte: do you have access to a terminal on a linux machine or VM?
08:45 ^🔗	SilSte	i can install a ubuntu vm
08:45 ^🔗	namespace	Okay.
08:45 ^🔗	namespace	Here's the manual.
08:45 ^🔗	arrith1	SilSte: sure, any linux you're familiar with
08:45 ^🔗	namespace	https://www.gnu.org/software/wget/manual/wget.html
08:46 ^🔗	namespace	It's super boring, but it'll tell you everything you'd want to know.
08:46 ^🔗	namespace	(Assuming you sort of know how a web server works.)
08:46 ^🔗	arrith1	namespace: if you do "--warc-file" and don't specify anything, does it make up its own name?
08:46 ^🔗	namespace	arrith1: I don't think so.
08:46 ^🔗	namespace	I never tried it.
08:47 ^🔗	namespace	I just redacted the file name too, for the same reasons.
08:47 ^🔗	arrith1	i think that would be neat. just take whatever wget is calling the file and append "warc.gz"
08:47 ^🔗	arrith1	ah
08:47 ^🔗	arrith1	SilSte: make sure the virtual hard drive of the VM has enough space to save a big site
08:48 ^🔗	arrith1	SilSte: could do some 500 GB, since the VM won't take up all the space until it needs it
08:48 ^🔗	SilSte	arrith1: is ubuntu good or would yo prefer debian?
08:48 ^🔗	namespace	Seems to be working.
08:48 ^🔗	arrith1	SilSte: whichever you're more familiar with
08:48 ^🔗	arrith1	namespace: what delay?
08:51 ^🔗	namespace	10 seconds
08:52 ^🔗	namespace	I think the user agent change fixed it.
08:52 ^🔗	namespace	I also turned off the rate limit.
08:52 ^🔗	arrith1	namespace: could try without the limit >:)
08:52 ^🔗	arrith1	heh
08:52 ^🔗	namespace	Because bandwidth isn't the bottleneck.
08:52 ^🔗	arrith1	if the site doesn't care, and can spare the bw
08:52 ^🔗	arrith1	yeah
08:52 ^🔗	arrith1	that's always nice. google reader downloading has been like that
08:53 ^🔗	namespace	The bottleneck is the wait time, which I have set to ten because it's not like the site is going anywhere.
08:58 ^🔗	arrith1	SilSte: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
08:58 ^🔗	arrith1	SilSte: http://pad.archivingyoursh.it/p/wget-warc
09:04 ^🔗	arrith1	SilSte: need to compile wget 1.14 (apt-get install build-essential openssl libssl-dev; tar xvf wget.tgz; cd wget; ./configure --with-ssl=openssl; make) then use the binary <wget dir>/src/wget, and can use this as a template: wget --warc-file=tunewiki --no-parent --mirror -w 2 -U "Wget/1.14 gzip ArchiveTeam" https://www.tunewiki.com
09:04 ^🔗	SilSte	arrith its still installing ;-)
09:05 ^🔗	SilSte	okay
09:05 ^🔗	SilSte	thx
09:05 ^🔗	arrith1	SilSte: alright. hardest part is compiling wget probably. you might get slowed/down blocked depending on how the site responds. if the site seems to be not limiting you, you can take off the "-w 2"
09:06 ^🔗	SilSte	ok
09:06 ^🔗	SilSte	how will I know that they are limiting?
09:07 ^🔗	arrith1	SilSte: if you have terminal questions you can ask in #ubuntu on the Freenode irc network
09:07 ^🔗	arrith1	SilSte: the command will probably error out. oh yeah, you might want to put in a retries thing: --tries=10
09:09 ^🔗	arrith1	SilSte: generally trial and error, you can ask in here or in #ubuntu or #debian on Freenode. one thing to keep in mind is if a binary isn't in your PATH then you need to specify the full path to it, so /home/user/wget_build/wget-1.14/src/wget --mirror <etc>
09:09 ^🔗	arrith1	SilSte: another channel to be in on this network is #archiveteam-bs
09:09 ^🔗	SilSte	ok
09:11 ^🔗	Smiley	thats here all the crap talk is.
09:13 ^🔗	namespace	Yeah, it's working now.
09:25 ^🔗	SilSte	arrith1: Wget is already installed in 1.14. Is it ok then?
09:25 ^🔗	namespace	Yeah.
09:25 ^🔗	Smiley	yah should be.
09:26 ^🔗	namespace	If it supports --warc-file it's good.
09:26 ^🔗	SilSte	ok
09:26 ^🔗	arrith1	SilSte: do "man wget" and look for warc
09:26 ^🔗	namespace	That's the new option they added last year.
09:26 ^🔗	arrith1	SilSte: or wget --help
09:27 ^🔗	SilSte	looks good
09:29 ^🔗	arrith1	SilSte: good to hear!, that's the benefit of installing a newer version of a distro i guess
09:29 ^🔗	SilSte	installed 13.04 server
09:29 ^🔗	arrith1	SilSte: you should be good to go, just make a dir, cd in, and try some wget stuff. that one i said should work as-is, but you can add what you want
09:29 ^🔗	SilSte	but it looks a little bit slow...
09:29 ^🔗	arrith1	SilSte: the download or the vm?
09:30 ^🔗	SilSte	download
09:30 ^🔗	arrith1	SilSte: btw i hope your hdd is huge
09:30 ^🔗	SilSte	it makes about one step per sec
09:30 ^🔗	arrith1	SilSte: you can remove the "-w 2" so it doesn't wait
09:30 ^🔗	SilSte	i made a 600GB file
09:30 ^🔗	arrith1	good :)
09:31 ^🔗	SilSte	will this be fast enough with that around 1 item per sec?
09:31 ^🔗	ivan`	"Wget/1.14 gzip" is a special string for Google, wget does not actually support gzip
09:31 ^🔗	SilSte	it made about 2MB right now...
09:32 ^🔗	arrith1	ivan`: ah, i meant to have the "ArchiveTeam" in there and copypasted heh
09:32 ^🔗	SilSte	and is it possible to pause and restart wget? Or will it start over?
09:32 ^🔗	arrith1	SilSte: that's a thing that limits how fast wget goes
09:34 ^🔗	SilSte	okay ... so i can try without... if I need to stop. Does it restart?
09:35 ^🔗	arrith1	SilSte: i don't know for sure, but i think it would be fine. to stop and resume, the stuff --mirror turns on should be fairly comprehensive
09:35 ^🔗	arrith1	SilSte: probably best to leave it on, and let it go as fast as the site will let it
09:35 ^🔗	arrith1	SilSte: ctrl-c to stop
09:36 ^🔗	SilSte	arrith1: "probably best to leave it on, and let it go as fast as the site will let it" with or without -w 2? Atm its on
09:36 ^🔗	SilSte	I'm getting some 404s... is that okay?
09:38 ^🔗	arrith1	SilSte: yeah. could try without the -w
09:41 ^🔗	SilSte	now its running ^^
09:42 ^🔗	SilSte	as long as its returning 200s everything should be fine I think
09:42 ^🔗	arrith1	SilSte: yep
09:42 ^🔗	SilSte	if they blacklist me... it will just stop?
09:42 ^🔗	arrith1	SilSte: might want to keep the directory you use fairly clean, so for each attempt could make a new directory
09:43 ^🔗	arrith1	SilSte: that's one way to blacklist. another is to show data that isn't the real data from the site (garbage data), or slowing it down
09:43 ^🔗	arrith1	SilSte: periodically it would be good to check the data you're getting
09:43 ^🔗	SilSte	how can i do this?
09:43 ^🔗	SilSte	ahh
09:43 ^🔗	SilSte	found the folder
09:43 ^🔗	arrith1	SilSte: you can setup a shared folder between your host and guest and copy files from the guest to the host to view them
09:44 ^🔗	arrith1	SilSte: yeah, something like ~/archiveteam/tunewiki/attempt1 then attempt2, attempt3, etc
09:44 ^🔗	SilSte	y different attemps? ^^
09:45 ^🔗	arrith1	SilSte: sometimes you want to start over or start fresh
09:45 ^🔗	SilSte	hmmm k
09:45 ^🔗	arrith1	maybe if one attempt was going wrong for some reason
09:45 ^🔗	SilSte	i will run it now and watch back later ;-)
09:46 ^🔗	arrith1	good luck :)
09:48 ^🔗	SilSte	I'm getting some "is not a directory"- errors
09:49 ^🔗	arrith1	after GR is down, spidering feedburner would be good. means i need to find solid crawling software. gnu parallel's example page has an interesting section on wget as a parallel crawler
09:49 ^🔗	arrith1	SilSte: hm odd, google errors you're curious about
09:49 ^🔗	SilSte	kk
09:49 ^🔗	SilSte	thx for your help <3
11:06 ^🔗	Nemo_bis	Why not disallow all log/ dir now that it exists? https://catalogd.us.archive.org/robots.txt
11:23 ^🔗	ivan`	SketchCow: thanks for that retweet
11:29 ^🔗	Nemo_bis	I'm sure if these are complete :S https://ia601803.us.archive.org/zipview.php?zip=/12/items/ftp-ftp.hp.com_ftp1/graham.zip https://ia601803.us.archive.org/zipview.php?zip=/12/items/ftp-ftp.hp.com_ftp1/catia.zip
12:46 ^🔗	Smiley	Ok, anyone alive to assist with me trying to setup this EC2 stuff?
12:46 ^🔗	Smiley	I can't even ssh in even though the rules seem ok D:
12:47 ^🔗	Smiley	Oh ffs, fixed that.
12:58 ^🔗	Smiley	Ok - next person who can help me getting this EC2 instance up and running please let me know. I have ubuntu server 13. something, wget 1.14, tornado, the seesaw stuff, all done.
13:45 ^🔗	SilSteStr	test
13:45 ^🔗	SilSteStr	okay
13:50 ^🔗	SilSteStr	gnah... wget does not continue the warc file -.-
13:51 ^🔗	ivan`	you can make a new warc and cat it onto the old one
13:52 ^🔗	SketchCow	No problem.
13:52 ^🔗	SketchCow	I'm packing up soon to drive south
13:53 ^🔗	PepsiMax	Any sources about xanga.com dying?
13:53 ^🔗	PepsiMax	I don't mind doing some CPU cycles/traffic for ArchiveTeam, but is it worth it? :-)
13:53 ^🔗	PepsiMax	- if they dont dy :P
13:53 ^🔗	PepsiMax	die
13:54 ^🔗	SilSteStr	ivan`: how?
13:55 ^🔗	SilSteStr	atm my comman is "wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com
13:56 ^🔗	Smiley	PepsiMax: either you do it, or you don't. We aren't here to convince you, but it's going.
13:56 ^🔗	Smiley	If you wanted to find it, you simpl;y need to visit the damn site itself.
13:57 ^🔗	SketchCow	So, XANGA.COM's Xanga Team log posted this entry a while ago:
13:58 ^🔗	SketchCow	http://thexangateam.xanga.com/773587240/relaunching-xanga-a-fundraiser/
13:58 ^🔗	SketchCow	* May 30th: We launch this fundraiser, and continue our work building a WordPress version of Xanga.
13:59 ^🔗	SketchCow	* Through July 15th: We will contact our registered members to let them know about the fundraiser, and also allow any and all users to download their blogs and media files for free.
13:59 ^🔗	SketchCow	* July 15th: This will be the final day for the fundraiser.
13:59 ^🔗	SketchCow	If we have a successful fundraiser:
13:59 ^🔗	SketchCow	* July 15th: If we've raise $60k, then we will move over to the new WordPress version on this date.
13:59 ^🔗	SketchCow	If the fundraiser isn't successful:
13:59 ^🔗	SketchCow	* July 15th: If we haven't raised $60k, then this will be the last date that Xanga is up and running.
13:59 ^🔗	SketchCow	...
13:59 ^🔗	SketchCow	So, that means that either 1. They're going to delete everything, or 2. They're going to utterly move everything to a new platform, which leads to lost formatting, items, and who knows what else.
14:05 ^🔗	SketchCow	http://thexangateam.xanga.com/774035061/update-on-relaunch-xanga-fundraiser-and-xanga-archives-news/ is a follow up post. They indicate how to download the blogs if you want to, but again they are not clear of how the blogs would change.
14:23 ^🔗	winr4r	yeah, and even if it's 2), they're going to a paid-account model, which means that all the old free users would probably lose all of their shit
14:23 ^🔗	winr4r	well, anyone that didn't sign up for the new paid service would lose their shit
14:24 ^🔗	winr4r	they're a bit unclear on that count, but that's what i've extrapolated from what they have not said
14:28 ^🔗	SketchCow	Right. This is all a mountain of uncertainty, making a backup worth doing.
14:28 ^🔗	SketchCow	And now PepsiMax got the learns
14:29 ^🔗	SilSteStr	if I need to stop my wget... what shall i do to prevent a total restart?
14:30 ^🔗	winr4r	is --no-clobber what you are looking for?
14:30 ^🔗	Smiley	eheeerrrghh what?
14:30 ^🔗	Smiley	ec2-bundle-vol has created loaaaaaaads of files
14:31 ^🔗	Smiley	image.part.{00..57}
14:31 ^🔗	Smiley	and a 10Gb image file too
14:32 ^🔗	SilSteStr	winr4r: don't know... but wget stopped downloading... I stopped it, startet it again and now the WARC File begins from the beginning....
14:32 ^🔗	PepsiMax	SketchCow: http://pphilip.xanga.com/774075894/your-blog-is-not-useless/
14:32 ^🔗	PepsiMax	sounds a bit good to be true :P
14:36 ^🔗	Smiley	errrr
14:36 ^🔗	Smiley	what are you trying to prove PepsiMax ?
14:38 ^🔗	Smiley	"The unfortunate thing about the xanga archives is that the html is hardcoded to link to images on the xanga servers - which will no longer be there. So you will have the text of your blogs - and comments - but you will not easily be able to find what pictures go with each blog entry after the xanga servers go down."
14:38 ^🔗	Smiley	Fail.
14:41 ^🔗	winr4r	SilSteStr: the --continue/-c option might do what you want, don't know how that plays with WARC though
14:44 ^🔗	SilSteStr	winr4r: its not working with WARC...
14:44 ^🔗	winr4r	:<
14:45 ^🔗	omf_	continue does not work with warc
14:46 ^🔗	SilSteStr	so do I really have to start over each time? Oo
14:47 ^🔗	Smiley	hmmmm kind of
14:47 ^🔗	Smiley	you should log the finished urls and then exclude them.
14:47 ^🔗	Smiley	:D
14:48 ^🔗	SilSteStr	Oo
14:48 ^🔗	SilSteStr	I'm not really familiar...
14:50 ^🔗	Smiley	me nither, but it'd work I think :D
14:51 ^🔗	*	omf_ poke ivan`
14:51 ^🔗	SilSteStr	lol
14:52 ^🔗	omf_	the problem with Smiley's idea is that wget limits how many urls you can skip because it is junk
14:52 ^🔗	Smiley	Doh!
14:56 ^🔗	Aranje	omf_:) sorry, power outage. dunno if sue said anything, I asked him to.
14:56 ^🔗	omf_	I got the message A
14:57 ^🔗	Aranje	kk. <3 sue
14:57 ^🔗	omf_	no worries
14:57 ^🔗	omf_	I am currently sucking down 150 million domain names
14:58 ^🔗	ivan`	sup
14:59 ^🔗	omf_	I got 150 million unique domain names ivan
14:59 ^🔗	ivan`	great
14:59 ^🔗	ivan`	can I grab them yet?
14:59 ^🔗	omf_	I am still downloading the lists
14:59 ^🔗	ivan`	alright
15:00 ^🔗	omf_	They are broken into blocks of 5000 for easy management
15:00 ^🔗	omf_	I also got all the urls from dmoz and 350,000 from a domain sale site
15:02 ^🔗	ivan`	for ameblo.jp,blog.livedoor.jp,feeds.feedburner.com,feeds2.feedburner.com,feeds.rapidfeeds.com,blog.roodo.com it would be super-great if you could get the thing after the first slash
15:02 ^🔗	ivan`	groups.yahoo.com/group/,groups.google.com/group/,www.wretch.cc/blog/ second slash
15:03 ^🔗	ivan`	youtube.com/user/ second slash but I don't know if I'll get to those, seems kind of low value anyway
15:03 ^🔗	SilSteStr	omf_: what are you doing?
15:03 ^🔗	omf_	collecting domain names which I plan to release as a data set on IA
15:03 ^🔗	SilSteStr	omf_: So what shall I use instead?
15:04 ^🔗	ivan`	thanks omf_
15:04 ^🔗	SilSteStr	omf_: kk
15:04 ^🔗	omf_	The normal big lists are only 1 million domains total and there are only 2 of those lists public
15:05 ^🔗	omf_	basically someone could seriously start a search engine using this list
15:05 ^🔗	Aranje	omf_:) http://pastebin.com/5y0aemPs
15:06 ^🔗	Aranje	primary assumption: installed on each node, not centrally
15:07 ^🔗	omf_	that is correct Aranje
15:07 ^🔗	Aranje	wonderful :)
15:07 ^🔗	*	Aranje fixes local config based on changes
15:07 ^🔗	*	Aranje grins
15:08 ^🔗	Smiley	http://www.governmentattic.org/8docs/NSA-WasntAllMagic_2002.pdf
15:08 ^🔗	Smiley	http://www.governmentattic.org/8docs/NSA-TrafficAnalysisMonograph_1993.pdf
15:08 ^🔗	Smiley	someone go get em and submit to IA plz
15:09 ^🔗	*	Smiley won't as he's going to get his train now.
15:09 ^🔗	omf_	got em
15:10 ^🔗	omf_	site is down again
15:10 ^🔗	omf_	too much HN/reddit traffic
15:13 ^🔗	ivan`	my 22GB/2.4 billion commoncrawl set http://204.12.192.194:32047/common_crawl_index_urls.bz2 will be up for another week, I do not really know how/where to upload to IA
15:13 ^🔗	omf_	I can take care of that for you ivan` if you want.
15:14 ^🔗	GLaDOS	I
15:14 ^🔗	GLaDOS	'll start a fetch for it onto anarchive
15:14 ^🔗	omf_	good idea GLaDOS
15:15 ^🔗	omf_	ivan`, how should I get this csv of domains to you?
15:16 ^🔗	ivan`	omf_: a torrent would be most convenient but just about anything will work
15:16 ^🔗	ivan`	how big is it?
15:16 ^🔗	omf_	this is just the 335k list
15:17 ^🔗	omf_	12mb uncompressed
15:17 ^🔗	ivan`	if it's <1GB http://allyourfeed.ludios.org:8080/
15:17 ^🔗	ivan`	heh
15:18 ^🔗	omf_	done
15:19 ^🔗	ivan`	got it, thanks
15:20 ^🔗	Aranje	I like this config better than the one I was using >_>
15:33 ^🔗	SilSteStr	ivan`: are there thing s.o. can help with google?
15:39 ^🔗	ivan`	SilSteStr: yes, we really need good query lists that will find more feeds using Reader's Feed Directory
15:39 ^🔗	ivan`	n-grams, obscure topics, words in every language, etc
15:40 ^🔗	ivan`	some of the sites listed on http://www.archiveteam.org/index.php?title=Google_Reader need to be spidered to find more users
15:40 ^🔗	ivan`	I can put up a list of every query that's been imported into greader-directory-grab for inspiration
15:41 ^🔗	SilSteStr	so i should run the "greader directory grab"?
15:41 ^🔗	ivan`	sure
15:42 ^🔗	ivan`	it does not do the tedious work of finding things to search for, however ;)
15:44 ^🔗	SilSteStr	how may i help there?
15:46 ^🔗	ivan`	you can google for big lists of things, see also wikipedia's many lists, and make clean lists of queries
15:46 ^🔗	ivan`	the queries get plugged into https://www.google.com/reader/view/#directory-page/1 - you can see if you get good results
15:47 ^🔗	SilSteStr	./o\ I need an google acc then :D
15:47 ^🔗	ivan`	we also need 2-grams for all the languages, that is, word pairs
15:47 ^🔗	ivan`	indeed
15:47 ^🔗	SilSteStr	wantet me to log in ^^
15:47 ^🔗	SilSteStr	lets continue in the other channel ^^
15:47 ^🔗	ivan`	yep
16:46 ^🔗	SilSteStr	I'm still getting those "is not a directory" failure with tunewiki :(. It's also telling: "Cannot write to XY" (Success) ...
16:47 ^🔗	SilSteStr	http://snag.gy/EHokz.jpg (german sry)
16:47 ^🔗	SilSteStr	any ideas?
16:47 ^🔗	SilSteStr	wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com
16:47 ^🔗	SilSteStr	is the command i used
20:00 ^🔗	namespace	SilSteStr: Why'd you turn off robots.txt?
20:00 ^🔗	namespace	Is their robots.txt stupid or?
20:00 ^🔗	SilSteStr	read this somewhere :D
20:01 ^🔗	namespace	It could also be the User Agent.
20:02 ^🔗	namespace	Oh I think I get it.
20:02 ^🔗	namespace	You need to turn off --no-parent
20:02 ^🔗	namespace	Tunewiki here would be the root directory.
20:02 ^🔗	namespace	So there's no point in having it on and it might be messing it up.
20:02 ^🔗	SilSteStr	hmmm
20:03 ^🔗	SilSteStr	so... should i delete everything?
20:03 ^🔗	namespace	Wait, is it a wikia forum?
20:03 ^🔗	namespace	No, it's not.
20:03 ^🔗	namespace	Does "everything" have no data?
20:04 ^🔗	namespace	You can check with a browser.
20:05 ^🔗	SilSteStr	no
20:05 ^🔗	SilSteStr	only some....
20:05 ^🔗	namespace	I wouldn't delete it if it's got data.
20:05 ^🔗	SilSteStr	and i'm not sure if there is no data...
20:05 ^🔗	namespace	I just said check with a browser.
20:05 ^🔗	SilSteStr	it just tells at some points that there is a "is not a directory" failure...
20:06 ^🔗	namespace	...
20:06 ^🔗	namespace	But it keeps grabbing?
20:06 ^🔗	SilSteStr	yes
20:06 ^🔗	SilSteStr	i googled this...
20:06 ^🔗	SilSteStr	one sec
20:07 ^🔗	SilSteStr	i found this
20:07 ^🔗	SilSteStr	http://superuser.com/questions/266112/mirroring-a-wordpress-site-with-wget
20:07 ^🔗	SilSteStr	but I don't know how to fix it...
20:07 ^🔗	Tephra	SilSteStr: I had that problem when there was a file name foo and then a directory named foo and wget tried to download to foo/bar, it couldn't create the directory foo since the file foo existed
20:07 ^🔗	Tephra	if that makes sense
20:08 ^🔗	SilSteStr	Tephra: I think this is the problem...
20:08 ^🔗	namespace	Man, archive teams combined knowledge could be used for some serious patches to wget.
20:09 ^🔗	Tephra	SilSteStr: can you try with --no-clobber ?
20:09 ^🔗	namespace	Not being able to resolve a file name duplication is fail.
20:09 ^🔗	SilSteStr	ok... I will try it in another folder...
20:10 ^🔗	SilSteStr	I'm already at h... ;-)
20:10 ^🔗	Tephra	namespace: yes and a serious pain in the ass, when you have been grabbing something for 1 h then start seeing that message
20:11 ^🔗	SilSteStr	I'm also not really sure if everything is fine... after about 6 hours the warc file has only 150MB...
20:13 ^🔗	SilSteStr	I get the failure
20:13 ^🔗	SilSteStr	"Timestamp" and "Overwriting old files" is at the same time not impossible
20:14 ^🔗	SilSteStr	(in German ^^, -> translated)
20:14 ^🔗	Tephra	do you mean possible?
20:17 ^🔗	Tephra	SilSteStr: try --force-directories maybe?
20:17 ^🔗	SilSteStr	Ã¤hh
20:17 ^🔗	SilSteStr	yes
20:17 ^🔗	SilSteStr	okay
20:17 ^🔗	SilSteStr	its running now
20:18 ^🔗	SilSteStr	same problem again
20:18 ^🔗	Tephra	seems like there's a bug filed: http://savannah.gnu.org/bugs/?29647
20:19 ^🔗	SilSteStr	in 2010 Oo
20:19 ^🔗	SilSteStr	"www.tunewiki.com/lyrics/rihanna: Is not a directory www.tunewiki.com/lyrics/rihanna/diamons: Is not a directory"
20:20 ^🔗	SilSteStr	"Cannot write to "www.tunewiki.com/lyrics/rihanna(diamons" (success).
20:20 ^🔗	Tephra	wget moves slow
20:21 ^🔗	Tephra	seems like a patch was made in 2012
20:22 ^🔗	SilSteStr	but will it work with 1.14? ^^
20:23 ^🔗	Tephra	dunno
20:25 ^🔗	SilSteStr	I'll make a first run... after there is hopefully time for tweaking...
20:29 ^🔗	Tephra	maybe we should file a bug report and hopefully it gets fixed
20:31 ^🔗	Tephra	SilSteStr: could you send me the complete command and url that you are trying?
20:32 ^🔗	SilSteStr	wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com
20:32 ^🔗	Tephra	thanks!
20:33 ^🔗	RichardG	I'm uploading Google Answers while IA doesn't move it to ArchiveTeam - right now doing batch 390 of 787
20:33 ^🔗	ivan`	wget does not support gzip don't put that in the user agent
20:35 ^🔗	SilSteStr	kk
20:36 ^🔗	SilSteStr	but doesnt change anything...
20:37 ^🔗	ivan`	I know, it's just a moral hazard to leave it in there
20:37 ^🔗	SilSteStr	kk ^^
20:38 ^🔗	ivan`	one day I'll want some gzip data and that sucky user agent has spread all over the internet
20:38 ^🔗	SilSteStr	if I sent a wget to background... is there a possibility to get it to the foreground?
20:38 ^🔗	ivan`	SilSteStr: you can start it in screen, detach, attach
20:38 ^🔗	ivan`	or tmux if you like that
20:39 ^🔗	SilSteStr	and if its already started? ^^
20:39 ^🔗	SilSteStr	chose to log to a file
20:39 ^🔗	ivan`	fg, maybe
20:40 ^🔗	SilSteStr	not working...
20:40 ^🔗	SilSteStr	looks like russian roulette then ;-)
20:43 ^🔗	Tephra	hmm can't get it to work, looks like a genuine bug to me
20:44 ^🔗	SilSteStr	kk
21:27 ^🔗	arrith1	SilSteStr: should google: how to cat files
21:27 ^🔗	arrith1	SilSteStr: that's how you cat warcs together
21:28 ^🔗	SilSteStr	this works with warcs?
21:29 ^🔗	SilSteStr	what about double files?
21:29 ^🔗	arrith1	SilSteStr: like wget when a thing is a directory and file with the same name?
21:31 ^🔗	SilSteStr	?
21:40 ^🔗	arkhive	How do I know if The Way Back Machine grabbed all of a site? www.xbdev.net/index.php
21:40 ^🔗	arkhive	Is there a way to compare it automatically?
21:41 ^🔗	arrith1	SilSteStr: what do you mean by "double files"?
21:42 ^🔗	arrith1	arkhive: "if you have a video list in a file you can use ia-dirdiff to check the items on IA"
21:42 ^🔗	arrith1	arrith1: that was said earlier in #archiveteam-bs, so ia-dirdiff might do what you want
21:43 ^🔗	arrith1	arrith1: one way to know for sure is to wget it yourself then upload it to IA with warcs ;)
21:59 ^🔗	arrith1	er
21:59 ^🔗	arrith1	arkhive:
21:59 ^🔗	arrith1	no idea what's going on
22:01 ^🔗	arkhive	arrith1: I can't grab it myself atm
22:02 ^🔗	arrith1	acknowledged
22:03 ^🔗	omf_	You could also check the CDX search for urls
22:10 ^🔗	SilSteStr	did you think about a rasperry pi warrior?
22:16 ^🔗	omf_	I have a raspberry pi warrior
22:16 ^🔗	omf_	I made it a few months ago
22:17 ^🔗	SilSteStr	does it work good?
22:18 ^🔗	SilSteStr	I would like something autonomous to spread to family ;-)
22:18 ^🔗	SilSteStr	already did this with tor
22:18 ^🔗	SilSteStr	small little boxes, safe config
22:19 ^🔗	SilSteStr	should work for warriors as well ;-)
22:20 ^🔗	arrith1	probably would be fairly CPU-bound, but possibly optimize-able. at least rasbian is sort of the default, shouldn't be too hard to attempt porting of the warrior stuff from the vm
22:20 ^🔗	omf_	it is not cpu bound it is RAM limited
22:20 ^🔗	omf_	wget is a filthy pig
22:21 ^🔗	arrith1	omf_: max ram i've heard of on the raspi is 512 MB, what sort of usage does the warrior's wget get to?
22:21 ^🔗	omf_	My plans are to do a few week test run doing url shorteners to see how it works out
22:21 ^🔗	omf_	wget uses more ram the more urls it collects
22:22 ^🔗	omf_	so the bigger the site, the more ram
22:23 ^🔗	arrith1	well, could try to do some heuristics to keep the urls the raspi warrior loads up at once within RAM limits
22:24 ^🔗	omf_	which would require changing the mess of shit code known as wget
22:24 ^🔗	omf_	I am testing out a warc convertor for httrack since it already is smart about managing ram and concurrent connections
22:25 ^🔗	arrith1	hmm maybe. i think wget-lua is fairly powerful though.
22:25 ^🔗	omf_	In terms of web scrapers wget is the drooling retard in the corner
22:25 ^🔗	xmc	heh, definitely
22:25 ^🔗	omf_	wget-lua is a hack to work around wget's design flaws
22:26 ^🔗	omf_	and it uses more ram since it has to run lua
22:29 ^🔗	omf_	we use wget because it has warc support built in. I am working on warc support for better applications
22:29 ^🔗	dashcloud	Nemo_bis: HP carries the service manuals for all their products on the product page- if you can easily extract those, it would be a great addition to the service manuals collection
22:29 ^🔗	omf_	just like someone took the time to build warc into wget. Evolution based on available developer time
22:34 ^🔗	Nemo_bis	where someone is dear al.ard
22:53 ^🔗	namespace	omf_: What program are you putting WARC into?
22:53 ^🔗	arrith1	i think httrack
22:54 ^🔗	arrith1	which interestingly is GPLv3. don't really see much GPLv3

irclogger-viewer