[00:10] RichardG: https://github.com/kngenie/ias3upload [00:10] thanks [00:11] RichardG: err, seems people recommend https://github.com/kimmel/ias3upload [00:14] I maintain https://github.com/kimmel/ias3upload which has bug fixes [00:23] never really used IA before, you say I should send it to community media so it's a better mirror than dropbox? [00:23] or contact them to get a collection? [00:27] RichardG, how large in file size is all of Google Answers [00:29] 313 MB in 7z ultra [00:29] but should send it uncompressed [00:29] which is like 3.4 GB [00:29] 3.5 actually [00:29] yeah you don't need a collection for that. An item can hold 20-50gb [00:38] I hope it's ok I saved without any site requirements, Google Answers is all text only anyways [00:40] compressing the cdxs for upload [00:45] feel like I'm doing something wrong if it falls in the Community Texts section.. :\ [00:46] RichardG: you can ask for your own collection, or maybe there's a google collection [00:48] awesome, I'm stuck now! [00:48] Users only have access to upload to a few collections unless they are added to others [00:49] RichardG: stuck waiting for a reply to mail the IA? [00:49] aaand my browser froze [00:51] RichardG: if you need someone to run that crawler for the blogs, i can. if it's bogging down your computer [00:51] nah... I just need to know what to do [00:52] I have a collection in community media, with a few files uploaded until I panic-killed the browser [00:52] I wasn ever into archiving before :P [00:55] I'm holding up on continuing to upload the Google Answers stuff until I come up with a solution... [00:55] can't find a collection, only spam [00:55] whois RichardG [00:55] whoops :P [00:55] sorrty [00:56] sorry** [00:56] spam where? [00:57] nah, it might just be my inability to find a collection to fit the Google Answers stuff [00:57] hmm... one of the ArchiveTeam collections? [00:57] basically: I have a few files accidentally in the opensource media section, and am looking for a proper home [00:57] could be, if I could move my original submission [00:58] hm, I think SketchCow and underscor have access to move things on IA [00:58] tell them to move this [00:58] http://archive.org/details/google-answers-archive [00:59] if it goes through, I"ll continue uploading tomorrow since it's late and 1.6 GB [00:59] it probably belongs in this collection: http://archive.org/details/archiveteam [00:59] wish I could upload overnight but tight on power bill [00:59] yeah [01:03] well, I'll go to sleep now, if that could be moved I would appreciate, then I will continue uploading the archive, thanks for the help [01:29] It's going to be hard to save all of the Yahoo! Answers data when(not if) Yahoo! decides to shut it down. [01:30] arkhive: preemptive action, especially crawls, for big sites is good [02:19] Anyone a system administrator at an ISP? [02:21] why do you ask, omf_? [02:25] Cause it would be nice to have access to the zone files of well used DNS servers [02:25] i might have something like that in a few months [02:26] i'll make a note and get back to you [02:26] We always seem to be looking for domain names and subdomain names [02:27] I may, keyword may, know someone who has access to that sort of thing, but this is no guarantee [02:29] damn he's not online [02:29] And having someone at an ISP would give us the access level we need to simplify most of these problems. Of course I could set up a DNS server and do it myself but ISP's servers get heavy use so the data is already there [02:58] omf_: Makes sense. [03:00] I used to work at an ISP so I know the data is there and easier to collect as an ISP [03:00] is this like query statistics or what [03:02] no. It is about discovering domain names without having to crawl sites looking for them [03:03] A system admin at a University would have the access needed as well [03:03] Aranje: Are you around when Archive Team decides to grab a site? [03:04] If you are then you'd know that the first thing they need to figure out is what to grab. [03:04] Most big sites are a mess of domains and subdomains. [03:04] crawling takes up a lot of valuable time [03:04] Being able to decide to archive a site and have it Just Work (TM) would be a real help. [03:07] http://en.wikipedia.org/wiki/Zone_file [03:11] so you're looking to be able to grab a zone file of the site to see if the subdomains are listed out? If yes, why would an ISP have access to that... doesn't only the site hosting have all of that info in one place? [03:12] this concept somewhat defies my (weak) knowledge of how dns is done [03:13] ... [03:13] Aranje: Think about this for a minute. [03:13] When you type in www.google.com into your browser, where does that request go to? [03:14] my computer [03:14] ... [03:14] Oh dear. [03:14] because I have unbound running in caching mode [03:14] Oh. [03:15] Well after that. [03:15] unbound asks... I forget one of the japanese ISP's that has a resolver in san jose [03:15] who, if it doesn't have a copy of the record goes and recurses down from . to get it [03:16] .? [03:16] the root [03:16] Oh. [03:16] Nevermind. [03:16] I'm not retarded, I just haven't eaten recently [03:16] lol [03:16] I thought I had this in my head, but now I'm confused. [03:17] I also run the dns shit for our hosting company, which is why I was interested in the first place [03:17] likely I can't help, but if I can get a handle on exactly what is needed maybe I know someone who can [03:19] An ISP can request the zone file for all .org sites for example from companies like godaddy and verisign [03:19] for local caching, I see [03:19] that is not something normal people can do because it takes resources to generation that file [03:20] I have tried filling out the paperwork to make it happen [03:20] They always say the free service is for ISPs not normal people [03:20] * Aranje nods [03:20] * Aranje understands now [03:20] Lets say I got the TLD .org zone file, I would not have 3+ million domain names [03:21] based on previous published numbers [03:22] There is no way to get all domain names everywhere but it is possible to get blocks of them measured in the millions [03:23] just have to put together 'omf ISP LLC' ;) [03:24] that'd be a fun project [03:24] We just need to ask around. There are 170 peeps in here and in that network of knowing people is probably a person who can help us. [03:24] sucking in a list of all domain names [03:25] Also someone who works at a big company and runs a DNS cache could find us sites as well [03:25] * Aranje nods [03:26] ISPs are best, universities usually get that access too, then large corps with many users would be next [03:27] There are places we could buy this information but fuck that [03:27] oooohh another idea. Someone who works at a hosting company [03:27] Amazon, Joyent, Digital Ocean, Linode, etc... they all provide multi-level DNS and caching [03:28] riverbed was the company I was thinking of [03:28] I have a friend there [03:29] Aranje, let me ask you about unbound [03:29] sure, I'll answer if I can [03:29] I was thinking having that running on a crawling server could really help speed up large scale grabs [03:30] Do you see speedups using it locally [03:30] the short answer is yes [03:31] I use it because I have charter as an ISP and I can't trust that dns requests will actually succeed, even to other providers [03:31] so I cache as much locally as possible [03:31] and there's no caching like local caching, for perf [03:31] who do point to for peering? opendns? [03:32] lemme figure it out, it's a japanese isp [03:32] they have an IP in san jose with very very good numbers [03:32] for people following along I recommend reading the short info on passive DNS here - https://security.isc.org/ [03:33] ahh [03:33] it's ntt [03:33] 129.250.35.250 is the IP I use [03:34] officially x.ns.gin.ntt.net [03:34] I'm halfway between LA and SF (in San Luis Obispo) and I've had great luck with stuff in SJC [03:34] Opera has done a lot of working on building domain lists [03:35] there's a dns performance test utility that I ran, and they got not the best speeds but the most stable [03:36] I used to run unbound in recursive caching mode, but I found switching to querying someone else gave me another drop in query time [03:37] If you have an archive box someplace, it might be fine to just find a dns server in the same datacenter. If that's not available, unbound is a great option. [03:37] omf_: So what you're saying is that you just need a copy of the file? [03:37] Copies of TLD zones files are the best solution since they have "everything" [03:38] I just found another service that will offer it but only to researchers and shit [03:38] If someone from IA applied they could probably get access. underscor [03:38] you can tell unbound to refetch both keys and full records ahead of their expiry as well, greatly reducing having to wait on queries if they're near ttl expiry [03:38] omf_: I was going to say, just ask anyway. [03:38] Godaddy/etc have a cheap ass profit motive not to. [03:39] The free for researchers guys might actually just give it to you if you ask. [03:39] This is who I was just referring to https://dnsdb.isc.org/ [03:40] I could ask around on IRC if you want. [03:40] please [03:43] I already have a few million domain names collected. I plan to build this list up and put it on IA [03:43] Most of the URL lists we put up already for sites that closed [03:44] omf_:) if you continue being interested, I can give you my unbound config and the list of caveats for it :) [03:44] Wait, a large should-be-public dataset that you plan to leak for the benefit of the public? [03:44] please Aranje I was looking to bundle unbound into a general crawler VM [03:44] yep [03:44] oh, perfect [03:44] You can probably find somebody willing to help with that sort of marketing. [03:45] I'll remove some of the caveats and pass it over. How much ram will the VM have? [03:45] Even set up an anonymous dump site. [03:45] ooh I should email Malamund [03:45] dnsleaks.org [03:45] :3 [03:45] for testing purposes 1gb [03:45] but I test it on butt providers with 8gb ram [03:47] okay, I'll tune some of the numbers down a bit. I think the way I have it set up it can use up to 256mbish of ram [03:47] but I cache for the house [03:47] it also never approaches that [03:47] :D [03:48] Does Archive Team have a blog? [03:48] malware domain lists are already public - http://www.malwaredomainlist.com/mdl.php [03:48] This sort of thing is why it's a good idea to have one. [03:48] Even with a thousand subscribers you'd probably have what you wanted in a day or two if you asked on it. [03:48] we have jason and well followed twitters [03:49] Eh, I wouldn't want to bother Jason unless it was really necessary. [03:50] #jengaforxanga [03:50] * Aranje makes assumptions about vm's and tunes accordingly [03:50] xanga and google reader would both benefit from this work [03:52] these vm's... are they debian or ubuntu? [03:52] neither [03:52] m [03:53] homebrew? or just a centos or something [03:53] nope [03:53] * Aranje wishes to tailor his config to the package likely to be installed [03:53] omf_: Just say it so I don't have to scan my VM. [03:54] should I prepare a full zip with all the necessary files? [03:54] so it runs in its own directory and gives zero fucks? [03:54] * Aranje grins [03:54] Since most butt providers are dumb I go with the newest Linux they offer. This is usually Fedora but I prefer opensuse since it stays up to date and more stable than most of everything else [03:54] why would it matter [03:54] butt providers? [03:55] the location of eg: root.hints and root.key changes [03:55] for dnssec validation [03:55] check this out namespace - https://github.com/panicsteve/cloud-to-butt it is a running joe [03:55] joke [03:55] I can handle that Aranje [03:56] That's great. [03:56] we also call them clown hosting [03:56] Because they're not even funny/ [03:56] *? [03:56] because you are a clown for using one [03:56] Ah. [03:57] I feel sorry for clows. [03:57] *clowns [03:57] because people think "cloud hosting" solves all their problems [03:57] Pop culture destroyed that occupation. [03:57] I found 3 studies on 1+ million domains and no source data provided. [03:58] expected singlecore vm's? [03:58] multiple cores [03:58] 2 seem like a sane default? [03:59] yes [03:59] (governs number of threads) [04:15] omf_:) do you want them looking up the addresses themselves or how I do it with basically being a caching proxy [04:17] caching [04:18] anything that makes speed the priority [04:19] just got 335,902 more domains [04:19] yep. I've run back over my config (found some problems with the one I wrote Ha!) and reread the whole man page while doing so. This one'll have you covered. [04:19] I am writing a crawler right now to collect domain lists from sale sites [04:19] thanks [04:25] Archive Team measures domain name collection in hundreds of thousands, anything less would be uncivilized :) [04:28] i'm making a update grab of thefeed on g4tv.com [04:28] I am going to put a crawler together for godaddy. I can get 500 domain names at a time [04:29] look what i have found: http://www.telnetbbsguide.com/dialbbs/dialbbs.htm [04:30] godane: Most telnet BBS's are empty. [04:30] I think I just found a list of all active domains in 2011 [04:30] omf_: if you target the Warrior could probably get it very fast [04:32] jackpot bitches [04:33] I just got 90 million unique domain names [04:33] 90,000,000 <- that is a lot of zeros [04:34] and that is just the .com list [04:35] 14 million .org [04:40] So now I have a nice big clean dataset to share [04:45] this is going to take a while to download [05:00] Omf_: aranje lost power [05:02] oh [05:06] also got .au, .ru, .net, .info, .us, .ca, .de and others [05:21] omf_: gj [05:26] Ugh, that moment when you've been grabbing for like a day and realize all your grabs are contentless. [05:27] namespace: should really be a tool for that [05:28] could use loose ml, something like 'these are good files, if later files differ significantly, notify me somehow' [05:30] Yeah. [05:30] We need to code one up. [05:30] That was really frustrating. [05:31] I'm just glad I checked before I'd grabbed everything. [05:31] That would have been such a waste. [05:31] maybe even have an option to be somewhat specific like "notify me if 2% differs, 5%, etc' [05:31] Eh. [05:31] You could make it even simpler. [05:31] namespace: i'm actually wondering that now. Google Reader hasn't really ratelimited and is honest about http codes but it's too much to check manually in time [05:31] If the files start coming out sub a certain amount of memory let me now. [05:32] *know [05:32] namespace: oh like smaller than some byte size/count [05:32] Like 9KB files probably aren't right. [05:32] "less than 4 KB, 2 KB" [05:32] ah, yeah, less than 1 MB or 0.5 MB even sometimes [05:33] Mine are less than a megabyte but probably more than ten kilobytes. [05:33] that wouldn't even take ml, just a periodic thing and a notification system [05:33] Yeah, that's what I'm saying. [05:33] It could be a bash script. [05:34] how to notify though is what i'm wondering. since people run things headless a lot [05:34] Though python or some such would be better for the sake of not having to deal with bash. [05:34] Email? [05:34] System beep? [05:34] bash gets the job done :P [05:34] maybe writing out an error message to a file and "touch STOP" ing [05:37] or killing the process, if the process doesn't support "touch STOP" [05:38] could have an analysis thing where if you have a few known good files, or a directory of known good files, scans them and suggests rounded values to use [05:39] should use inotify on linux [05:54] alright on my TODO list to write that in python. but no one feel like they shouldn't write one if they want to [08:14] Ugh. Page grabs work when I use wget, but a full grab gets me contentless crap. [08:14] *use wget on a single page [08:16] namespace: some kind of ratelimiting? [08:16] arrith1: Maybe. [08:16] It's a vbulletin forum. [08:17] I'm waiting ten seconds between items. [08:17] well test out the URLs in the browser, also could be sending diff stuff depending on user agent, could try the fx addon user agent switcher [08:17] Yeah, I was gonna try to use a different user agent. [08:17] i know lots of forums send different data to user agent claiming to be crawlers, like will require registration unless the ua is googlebot [08:18] I am registered. [08:18] And a long-time member, for that matter. [08:18] namespace: is your wget supplying cookies? [08:18] arrith1: Yup. [08:19] And yes they're still valid. [08:19] welll, could just be a poorly coded user-agent hack thing. i'd use your browser UA and/or try fx user agent addon switcher [08:19] namespace: changed your browser ua? [08:19] Hmm. [08:19] I could try changing my browser UA to test. [08:19] what exactly is the contentless crap? [08:20] Hi @ all [08:20] Tunewiki is closing... [08:20] http://www.tunewiki.com/news/186/tunewiki-is-shutting-down [08:22] ivan`: A message you get when you look at the sites index.html [08:23] and is there a problem with xanga? Stopped getting new items... [08:24] SilSte: might be. a few users are reporting issues. xanga discussion in #jenga [08:24] SilSte: could make a page on the ArchiveTeam wiki for tunewiki if you want [08:24] I'm unsure if its important enought ^^ [08:24] namespace: hm so sometimes you're able to get pages from wget, but not successively [08:24] And there are only 4 days left... [08:25] SilSte: only thing that makes a site important enough is that people want to save it [08:25] hmm [08:25] ^ THis [08:26] I think my options were: [08:27] ./wget --warc-file --no-parent --mirror -w 10 --limit-rate 56k --verbose --load-cookies [08:27] (URL's redacted for privacy reasons.) [08:30] namespace: -erobots=off, also if you don't specify a ua then it does wget. i'd put money on vbulletin shipping with something to handle wget UAs [08:30] The robots.txt has nothing in it, basically. [08:31] Except for a 2 minute crawl limit. [08:31] Or something like that. [08:31] wikisecretword? [08:31] namespace: could set your delay time to 2min and set your UA to a normal firefox/chrome UA [08:31] SilSte: yahoosucks [08:31] SilSte: you didn't say the line btw :P [08:31] thx [08:31] arrith1: I'll try that. [08:32] How do I get a firefox UA? [08:32] google for "what is my UA" and pages will show you [08:32] I know how to switch it of course. [08:32] Ah, okay. [08:32] so you can use your actual browser UA [08:33] also like livehttpheaders fx plugin has that info, wireshark would also show it. might be some about:config thing that says it. also copying/pasting from firefox useragent switcher addon, but using your own is more stealthy if you visit a site a lot [08:36] At 2 mins a page this should take a few months. :P [08:37] namespace: hmm, well one upside is you probably would be totally safe from triggering any ratelimiting heh [08:37] I'm more afraid that the connection would time out before I finish. [08:38] Remember, WARC has no timestamping. [08:38] namespace: might want to save wget logs then [08:38] namespace: arrith: what are you doing? [08:38] namespace: btw you can do retries, if it times out then it'll retry [08:39] SilSte: namespace is saving some forums he likes a lot, and i'm working on ways to get RSS/atom feed urls for the Google Reader effort. right now that means a crawler to get usernames from livejournal [08:40] kk [08:40] qw3rty+P3R50N4L [08:40] hoshit [08:41] I made a Wikisite [08:41] http://www.archiveteam.org/index.php?title=TuneWiki [08:42] I'm not familiar with building the warrior etc... just wanted to inform ^^ [08:42] arrith1: Yeah, two minutes isn't happening. [08:42] haha [08:43] Sadly it's not doable on a sub 1200 baud modem. [08:44] I think the downloading of tunewiki should be kind of easy... there are list with the artists ;-) [08:44] namespace: well you could try without a limit, if all it cares about is the UA. if you aren't downloading it a bunch then one quick dl shouldn't be too noticeable [08:44] This is my third attempt. [08:44] SilSte: would you like to download tunewiki? with a single wget command you probably could grab it all [08:44] I think it might be a little noticeable. [08:44] if you tell me how... i can try [08:45] SilSte: do you have access to a terminal on a linux machine or VM? [08:45] i can install a ubuntu vm [08:45] Okay. [08:45] Here's the manual. [08:45] SilSte: sure, any linux you're familiar with [08:45] https://www.gnu.org/software/wget/manual/wget.html [08:46] It's super boring, but it'll tell you everything you'd want to know. [08:46] (Assuming you sort of know how a web server works.) [08:46] namespace: if you do "--warc-file" and don't specify anything, does it make up its own name? [08:46] arrith1: I don't think so. [08:46] I never tried it. [08:47] I just redacted the file name too, for the same reasons. [08:47] i think that would be neat. just take whatever wget is calling the file and append "warc.gz" [08:47] ah [08:47] SilSte: make sure the virtual hard drive of the VM has enough space to save a big site [08:48] SilSte: could do some 500 GB, since the VM won't take up all the space until it needs it [08:48] arrith1: is ubuntu good or would yo prefer debian? [08:48] Seems to be working. [08:48] SilSte: whichever you're more familiar with [08:48] namespace: what delay? [08:51] 10 seconds [08:52] I think the user agent change fixed it. [08:52] I also turned off the rate limit. [08:52] namespace: could try without the limit >:) [08:52] heh [08:52] Because bandwidth isn't the bottleneck. [08:52] if the site doesn't care, and can spare the bw [08:52] yeah [08:52] that's always nice. google reader downloading has been like that [08:53] The bottleneck is the wait time, which I have set to ten because it's not like the site is going anywhere. [08:58] SilSte: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output [08:58] SilSte: http://pad.archivingyoursh.it/p/wget-warc [09:04] SilSte: need to compile wget 1.14 (apt-get install build-essential openssl libssl-dev; tar xvf wget.tgz; cd wget; ./configure --with-ssl=openssl; make) then use the binary /src/wget, and can use this as a template: wget --warc-file=tunewiki --no-parent --mirror -w 2 -U "Wget/1.14 gzip ArchiveTeam" https://www.tunewiki.com [09:04] arrith its still installing ;-) [09:05] okay [09:05] thx [09:05] SilSte: alright. hardest part is compiling wget probably. you might get slowed/down blocked depending on how the site responds. if the site seems to be not limiting you, you can take off the "-w 2" [09:06] ok [09:06] how will I know that they are limiting? [09:07] SilSte: if you have terminal questions you can ask in #ubuntu on the Freenode irc network [09:07] SilSte: the command will probably error out. oh yeah, you might want to put in a retries thing: --tries=10 [09:09] SilSte: generally trial and error, you can ask in here or in #ubuntu or #debian on Freenode. one thing to keep in mind is if a binary isn't in your PATH then you need to specify the full path to it, so /home/user/wget_build/wget-1.14/src/wget --mirror [09:09] SilSte: another channel to be in on this network is #archiveteam-bs [09:09] ok [09:11] thats here all the crap talk is. [09:13] Yeah, it's working now. [09:25] arrith1: Wget is already installed in 1.14. Is it ok then? [09:25] Yeah. [09:25] yah should be. [09:26] If it supports --warc-file it's good. [09:26] ok [09:26] SilSte: do "man wget" and look for warc [09:26] That's the new option they added last year. [09:26] SilSte: or wget --help [09:27] looks good [09:29] SilSte: good to hear!, that's the benefit of installing a newer version of a distro i guess [09:29] installed 13.04 server [09:29] SilSte: you should be good to go, just make a dir, cd in, and try some wget stuff. that one i said should work as-is, but you can add what you want [09:29] but it looks a little bit slow... [09:29] SilSte: the download or the vm? [09:30] download [09:30] SilSte: btw i hope your hdd is huge [09:30] it makes about one step per sec [09:30] SilSte: you can remove the "-w 2" so it doesn't wait [09:30] i made a 600GB file [09:30] good :) [09:31] will this be fast enough with that around 1 item per sec? [09:31] "Wget/1.14 gzip" is a special string for Google, wget does not actually support gzip [09:31] it made about 2MB right now... [09:32] ivan`: ah, i meant to have the "ArchiveTeam" in there and copypasted heh [09:32] and is it possible to pause and restart wget? Or will it start over? [09:32] SilSte: that's a thing that limits how fast wget goes [09:34] okay ... so i can try without... if I need to stop. Does it restart? [09:35] SilSte: i don't know for sure, but i think it would be fine. to stop and resume, the stuff --mirror turns on should be fairly comprehensive [09:35] SilSte: probably best to leave it on, and let it go as fast as the site will let it [09:35] SilSte: ctrl-c to stop [09:36] arrith1: "probably best to leave it on, and let it go as fast as the site will let it" with or without -w 2? Atm its on [09:36] I'm getting some 404s... is that okay? [09:38] SilSte: yeah. could try without the -w [09:41] now its running ^^ [09:42] as long as its returning 200s everything should be fine I think [09:42] SilSte: yep [09:42] if they blacklist me... it will just stop? [09:42] SilSte: might want to keep the directory you use fairly clean, so for each attempt could make a new directory [09:43] SilSte: that's one way to blacklist. another is to show data that isn't the real data from the site (garbage data), or slowing it down [09:43] SilSte: periodically it would be good to check the data you're getting [09:43] how can i do this? [09:43] ahh [09:43] found the folder [09:43] SilSte: you can setup a shared folder between your host and guest and copy files from the guest to the host to view them [09:44] SilSte: yeah, something like ~/archiveteam/tunewiki/attempt1 then attempt2, attempt3, etc [09:44] y different attemps? ^^ [09:45] SilSte: sometimes you want to start over or start fresh [09:45] hmmm k [09:45] maybe if one attempt was going wrong for some reason [09:45] i will run it now and watch back later ;-) [09:46] good luck :) [09:48] I'm getting some "is not a directory"- errors [09:49] after GR is down, spidering feedburner would be good. means i need to find solid crawling software. gnu parallel's example page has an interesting section on wget as a parallel crawler [09:49] SilSte: hm odd, google errors you're curious about [09:49] kk [09:49] thx for your help <3 [11:06] Why not disallow all log/ dir now that it exists? https://catalogd.us.archive.org/robots.txt [11:23] SketchCow: thanks for that retweet [11:29] I'm sure if these are complete :S https://ia601803.us.archive.org/zipview.php?zip=/12/items/ftp-ftp.hp.com_ftp1/graham.zip https://ia601803.us.archive.org/zipview.php?zip=/12/items/ftp-ftp.hp.com_ftp1/catia.zip [12:46] Ok, anyone alive to assist with me trying to setup this EC2 stuff? [12:46] I can't even ssh in even though the rules seem ok D: [12:47] Oh ffs, fixed that. [12:58] Ok - next person who can help me getting this EC2 instance up and running please let me know. I have ubuntu server 13. something, wget 1.14, tornado, the seesaw stuff, all done. [13:45] test [13:45] okay [13:50] gnah... wget does not continue the warc file -.- [13:51] you can make a new warc and cat it onto the old one [13:52] No problem. [13:52] I'm packing up soon to drive south [13:53] Any sources about xanga.com dying? [13:53] I don't mind doing some CPU cycles/traffic for ArchiveTeam, but is it worth it? :-) [13:53] - if they dont dy :P [13:53] die [13:54] ivan`: how? [13:55] atm my comman is "wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com [13:56] PepsiMax: either you do it, or you don't. We aren't here to convince you, but it's going. [13:56] If you wanted to find it, you simpl;y need to visit the damn site itself. [13:57] So, XANGA.COM's Xanga Team log posted this entry a while ago: [13:58] http://thexangateam.xanga.com/773587240/relaunching-xanga-a-fundraiser/ [13:58] * May 30th: We launch this fundraiser, and continue our work building a WordPress version of Xanga. [13:59] * Through July 15th: We will contact our registered members to let them know about the fundraiser, and also allow any and all users to download their blogs and media files for free. [13:59] * July 15th: This will be the final day for the fundraiser. [13:59] If we have a successful fundraiser: [13:59] * July 15th: If we've raise $60k, then we will move over to the new WordPress version on this date. [13:59] If the fundraiser isn't successful: [13:59] * July 15th: If we haven't raised $60k, then this will be the last date that Xanga is up and running. [13:59] ... [13:59] So, that means that either 1. They're going to delete everything, or 2. They're going to utterly move everything to a new platform, which leads to lost formatting, items, and who knows what else. [14:05] http://thexangateam.xanga.com/774035061/update-on-relaunch-xanga-fundraiser-and-xanga-archives-news/ is a follow up post. They indicate how to download the blogs if you want to, but again they are not clear of how the blogs would change. [14:23] yeah, and even if it's 2), they're going to a paid-account model, which means that all the old free users would probably lose all of their shit [14:23] well, anyone that didn't sign up for the new paid service would lose their shit [14:24] they're a bit unclear on that count, but that's what i've extrapolated from what they have *not* said [14:28] Right. This is all a mountain of uncertainty, making a backup worth doing. [14:28] And now PepsiMax got the learns [14:29] if I need to stop my wget... what shall i do to prevent a total restart? [14:30] is --no-clobber what you are looking for? [14:30] eheeerrrghh what? [14:30] ec2-bundle-vol has created loaaaaaaads of files [14:31] image.part.{00..57} [14:31] and a 10Gb image file too [14:32] winr4r: don't know... but wget stopped downloading... I stopped it, startet it again and now the WARC File begins from the beginning.... [14:32] SketchCow: http://pphilip.xanga.com/774075894/your-blog-is-not-useless/ [14:32] sounds a bit good to be true :P [14:36] errrr [14:36] what are you trying to prove PepsiMax ? [14:38] "The unfortunate thing about the xanga archives is that the html is hardcoded to link to images on the xanga servers - which will no longer be there. So you will have the text of your blogs - and comments - but you will not easily be able to find what pictures go with each blog entry after the xanga servers go down." [14:38] Fail. [14:41] SilSteStr: the --continue/-c option might do what you want, don't know how that plays with WARC though [14:44] winr4r: its not working with WARC... [14:44] :< [14:45] continue does not work with warc [14:46] so do I really have to start over each time? Oo [14:47] hmmmm kind of [14:47] you should log the finished urls and then exclude them. [14:47] :D [14:48] Oo [14:48] I'm not really familiar... [14:50] me nither, but it'd work I think :D [14:51] * omf_ poke ivan` [14:51] lol [14:52] the problem with Smiley's idea is that wget limits how many urls you can skip because it is junk [14:52] Doh! [14:56] omf_:) sorry, power outage. dunno if sue said anything, I asked him to. [14:56] I got the message A [14:57] kk. <3 sue [14:57] no worries [14:57] I am currently sucking down 150 million domain names [14:58] sup [14:59] I got 150 million unique domain names ivan [14:59] great [14:59] can I grab them yet? [14:59] I am still downloading the lists [14:59] alright [15:00] They are broken into blocks of 5000 for easy management [15:00] I also got all the urls from dmoz and 350,000 from a domain sale site [15:02] for ameblo.jp,blog.livedoor.jp,feeds.feedburner.com,feeds2.feedburner.com,feeds.rapidfeeds.com,blog.roodo.com it would be super-great if you could get the thing after the first slash [15:02] groups.yahoo.com/group/,groups.google.com/group/,www.wretch.cc/blog/ second slash [15:03] youtube.com/user/ second slash but I don't know if I'll get to those, seems kind of low value anyway [15:03] omf_: what are you doing? [15:03] collecting domain names which I plan to release as a data set on IA [15:03] omf_: So what shall I use instead? [15:04] thanks omf_ [15:04] omf_: kk [15:04] The normal big lists are only 1 million domains total and there are only 2 of those lists public [15:05] basically someone could seriously start a search engine using this list [15:05] omf_:) http://pastebin.com/5y0aemPs [15:06] primary assumption: installed on each node, not centrally [15:07] that is correct Aranje [15:07] wonderful :) [15:07] * Aranje fixes local config based on changes [15:07] * Aranje grins [15:08] http://www.governmentattic.org/8docs/NSA-WasntAllMagic_2002.pdf [15:08] http://www.governmentattic.org/8docs/NSA-TrafficAnalysisMonograph_1993.pdf [15:08] someone go get em and submit to IA plz [15:09] * Smiley won't as he's going to get his train now. [15:09] got em [15:10] site is down again [15:10] too much HN/reddit traffic [15:13] my 22GB/2.4 billion commoncrawl set http://204.12.192.194:32047/common_crawl_index_urls.bz2 will be up for another week, I do not really know how/where to upload to IA [15:13] I can take care of that for you ivan` if you want. [15:14] I [15:14] 'll start a fetch for it onto anarchive [15:14] good idea GLaDOS [15:15] ivan`, how should I get this csv of domains to you? [15:16] omf_: a torrent would be most convenient but just about anything will work [15:16] how big is it? [15:16] this is just the 335k list [15:17] 12mb uncompressed [15:17] if it's <1GB http://allyourfeed.ludios.org:8080/ [15:17] heh [15:18] done [15:19] got it, thanks [15:20] I like this config better than the one I was using >_> [15:33] ivan`: are there thing s.o. can help with google? [15:39] SilSteStr: yes, we really need good query lists that will find more feeds using Reader's Feed Directory [15:39] n-grams, obscure topics, words in every language, etc [15:40] some of the sites listed on http://www.archiveteam.org/index.php?title=Google_Reader need to be spidered to find more users [15:40] I can put up a list of every query that's been imported into greader-directory-grab for inspiration [15:41] so i should run the "greader directory grab"? [15:41] sure [15:42] it does not do the tedious work of finding things to search for, however ;) [15:44] how may i help there? [15:46] you can google for big lists of things, see also wikipedia's many lists, and make clean lists of queries [15:46] the queries get plugged into https://www.google.com/reader/view/#directory-page/1 - you can see if you get good results [15:47] ./o\ I need an google acc then :D [15:47] we also need 2-grams for all the languages, that is, word pairs [15:47] indeed [15:47] wantet me to log in ^^ [15:47] lets continue in the other channel ^^ [15:47] yep [16:46] I'm still getting those "is not a directory" failure with tunewiki :(. It's also telling: "Cannot write to XY" (Success) ... [16:47] http://snag.gy/EHokz.jpg (german sry) [16:47] any ideas? [16:47] wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com [16:47] is the command i used [20:00] SilSteStr: Why'd you turn off robots.txt? [20:00] Is their robots.txt stupid or? [20:00] read this somewhere :D [20:01] It could also be the User Agent. [20:02] Oh I think I get it. [20:02] You need to turn off --no-parent [20:02] Tunewiki here would be the root directory. [20:02] So there's no point in having it on and it might be messing it up. [20:02] hmmm [20:03] so... should i delete everything? [20:03] Wait, is it a wikia forum? [20:03] No, it's not. [20:03] Does "everything" have no data? [20:04] You can check with a browser. [20:05] no [20:05] only some.... [20:05] I wouldn't delete it if it's got data. [20:05] and i'm not sure if there is no data... [20:05] I just said check with a browser. [20:05] it just tells at some points that there is a "is not a directory" failure... [20:06] ... [20:06] But it keeps grabbing? [20:06] yes [20:06] i googled this... [20:06] one sec [20:07] i found this [20:07] http://superuser.com/questions/266112/mirroring-a-wordpress-site-with-wget [20:07] but I don't know how to fix it... [20:07] SilSteStr: I had that problem when there was a file name foo and then a directory named foo and wget tried to download to foo/bar, it couldn't create the directory foo since the file foo existed [20:07] if that makes sense [20:08] Tephra: I think this is the problem... [20:08] Man, archive teams combined knowledge could be used for some serious patches to wget. [20:09] SilSteStr: can you try with --no-clobber ? [20:09] Not being able to resolve a file name duplication is fail. [20:09] ok... I will try it in another folder... [20:10] I'm already at h... ;-) [20:10] namespace: yes and a serious pain in the ass, when you have been grabbing something for 1 h then start seeing that message [20:11] I'm also not really sure if everything is fine... after about 6 hours the warc file has only 150MB... [20:13] I get the failure [20:13] "Timestamp" and "Overwriting old files" is at the same time not impossible [20:14] (in German ^^, -> translated) [20:14] do you mean possible? [20:17] SilSteStr: try --force-directories maybe? [20:17] ähh [20:17] yes [20:17] okay [20:17] its running now [20:18] same problem again [20:18] seems like there's a bug filed: http://savannah.gnu.org/bugs/?29647 [20:19] in 2010 Oo [20:19] "www.tunewiki.com/lyrics/rihanna: Is not a directory www.tunewiki.com/lyrics/rihanna/diamons: Is not a directory" [20:20] "Cannot write to "www.tunewiki.com/lyrics/rihanna(diamons" (success). [20:20] wget moves slow [20:21] seems like a patch was made in 2012 [20:22] but will it work with 1.14? ^^ [20:23] dunno [20:25] I'll make a first run... after there is hopefully time for tweaking... [20:29] maybe we should file a bug report and hopefully it gets fixed [20:31] SilSteStr: could you send me the complete command and url that you are trying? [20:32] wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com [20:32] thanks! [20:33] I'm uploading Google Answers while IA doesn't move it to ArchiveTeam - right now doing batch 390 of 787 [20:33] wget does not support gzip don't put that in the user agent [20:35] kk [20:36] but doesnt change anything... [20:37] I know, it's just a moral hazard to leave it in there [20:37] kk ^^ [20:38] one day I'll want some gzip data and that sucky user agent has spread all over the internet [20:38] if I sent a wget to background... is there a possibility to get it to the foreground? [20:38] SilSteStr: you can start it in screen, detach, attach [20:38] or tmux if you like that [20:39] and if its already started? ^^ [20:39] chose to log to a file [20:39] fg, maybe [20:40] not working... [20:40] looks like russian roulette then ;-) [20:43] hmm can't get it to work, looks like a genuine bug to me [20:44] kk [21:27] SilSteStr: should google: how to cat files [21:27] SilSteStr: that's how you cat warcs together [21:28] this works with warcs? [21:29] what about double files? [21:29] SilSteStr: like wget when a thing is a directory and file with the same name? [21:31] ? [21:40] How do I know if The Way Back Machine grabbed all of a site? www.xbdev.net/index.php [21:40] Is there a way to compare it automatically? [21:41] SilSteStr: what do you mean by "double files"? [21:42] arkhive: "if you have a video list in a file you can use ia-dirdiff to check the items on IA" [21:42] arrith1: that was said earlier in #archiveteam-bs, so ia-dirdiff might do what you want [21:43] arrith1: one way to know for sure is to wget it yourself then upload it to IA with warcs ;) [21:59] er [21:59] arkhive: [21:59] no idea what's going on [22:01] arrith1: I can't grab it myself atm [22:02] acknowledged [22:03] You could also check the CDX search for urls [22:10] did you think about a rasperry pi warrior? [22:16] I have a raspberry pi warrior [22:16] I made it a few months ago [22:17] does it work good? [22:18] I would like something autonomous to spread to family ;-) [22:18] already did this with tor [22:18] small little boxes, safe config [22:19] should work for warriors as well ;-) [22:20] probably would be fairly CPU-bound, but possibly optimize-able. at least rasbian is sort of the default, shouldn't be too hard to attempt porting of the warrior stuff from the vm [22:20] it is not cpu bound it is RAM limited [22:20] wget is a filthy pig [22:21] omf_: max ram i've heard of on the raspi is 512 MB, what sort of usage does the warrior's wget get to? [22:21] My plans are to do a few week test run doing url shorteners to see how it works out [22:21] wget uses more ram the more urls it collects [22:22] so the bigger the site, the more ram [22:23] well, could try to do some heuristics to keep the urls the raspi warrior loads up at once within RAM limits [22:24] which would require changing the mess of shit code known as wget [22:24] I am testing out a warc convertor for httrack since it already is smart about managing ram and concurrent connections [22:25] hmm maybe. i think wget-lua is fairly powerful though. [22:25] In terms of web scrapers wget is the drooling retard in the corner [22:25] heh, definitely [22:25] wget-lua is a hack to work around wget's design flaws [22:26] and it uses more ram since it has to run lua [22:29] we use wget because it has warc support built in. I am working on warc support for better applications [22:29] Nemo_bis: HP carries the service manuals for all their products on the product page- if you can easily extract those, it would be a great addition to the service manuals collection [22:29] just like someone took the time to build warc into wget. Evolution based on available developer time [22:34] where someone is dear al.ard [22:53] omf_: What program are you putting WARC into? [22:53] i think httrack [22:54] which interestingly is GPLv3. don't really see much GPLv3