#archiveteam 2013-06-24,Mon

โ†‘back Search

Time Nickname Message
00:10 ๐Ÿ”— arrith1 RichardG: https://github.com/kngenie/ias3upload
00:10 ๐Ÿ”— RichardG thanks
00:11 ๐Ÿ”— arrith1 RichardG: err, seems people recommend https://github.com/kimmel/ias3upload
00:14 ๐Ÿ”— omf_ I maintain https://github.com/kimmel/ias3upload which has bug fixes
00:23 ๐Ÿ”— RichardG never really used IA before, you say I should send it to community media so it's a better mirror than dropbox?
00:23 ๐Ÿ”— RichardG or contact them to get a collection?
00:27 ๐Ÿ”— omf_ RichardG, how large in file size is all of Google Answers
00:29 ๐Ÿ”— RichardG 313 MB in 7z ultra
00:29 ๐Ÿ”— RichardG but should send it uncompressed
00:29 ๐Ÿ”— RichardG which is like 3.4 GB
00:29 ๐Ÿ”— RichardG 3.5 actually
00:29 ๐Ÿ”— omf_ yeah you don't need a collection for that. An item can hold 20-50gb
00:38 ๐Ÿ”— RichardG I hope it's ok I saved without any site requirements, Google Answers is all text only anyways
00:40 ๐Ÿ”— RichardG compressing the cdxs for upload
00:45 ๐Ÿ”— RichardG feel like I'm doing something wrong if it falls in the Community Texts section.. :\
00:46 ๐Ÿ”— arrith1 RichardG: you can ask for your own collection, or maybe there's a google collection
00:48 ๐Ÿ”— RichardG awesome, I'm stuck now!
00:48 ๐Ÿ”— omf_ Users only have access to upload to a few collections unless they are added to others
00:49 ๐Ÿ”— arrith1 RichardG: stuck waiting for a reply to mail the IA?
00:49 ๐Ÿ”— RichardG aaand my browser froze
00:51 ๐Ÿ”— arrith1 RichardG: if you need someone to run that crawler for the blogs, i can. if it's bogging down your computer
00:51 ๐Ÿ”— RichardG nah... I just need to know what to do
00:52 ๐Ÿ”— RichardG I have a collection in community media, with a few files uploaded until I panic-killed the browser
00:52 ๐Ÿ”— RichardG I wasn ever into archiving before :P
00:55 ๐Ÿ”— RichardG I'm holding up on continuing to upload the Google Answers stuff until I come up with a solution...
00:55 ๐Ÿ”— RichardG can't find a collection, only spam
00:55 ๐Ÿ”— balrog whois RichardG
00:55 ๐Ÿ”— balrog whoops :P
00:55 ๐Ÿ”— balrog sorrty
00:56 ๐Ÿ”— balrog sorry**
00:56 ๐Ÿ”— balrog spam where?
00:57 ๐Ÿ”— RichardG nah, it might just be my inability to find a collection to fit the Google Answers stuff
00:57 ๐Ÿ”— balrog hmm... one of the ArchiveTeam collections?
00:57 ๐Ÿ”— RichardG basically: I have a few files accidentally in the opensource media section, and am looking for a proper home
00:57 ๐Ÿ”— RichardG could be, if I could move my original submission
00:58 ๐Ÿ”— balrog hm, I think SketchCow and underscor have access to move things on IA
00:58 ๐Ÿ”— RichardG tell them to move this
00:58 ๐Ÿ”— RichardG http://archive.org/details/google-answers-archive
00:59 ๐Ÿ”— RichardG if it goes through, I"ll continue uploading tomorrow since it's late and 1.6 GB
00:59 ๐Ÿ”— balrog it probably belongs in this collection: http://archive.org/details/archiveteam
00:59 ๐Ÿ”— RichardG wish I could upload overnight but tight on power bill
00:59 ๐Ÿ”— RichardG yeah
01:03 ๐Ÿ”— RichardG well, I'll go to sleep now, if that could be moved I would appreciate, then I will continue uploading the archive, thanks for the help
01:29 ๐Ÿ”— arkhive It's going to be hard to save all of the Yahoo! Answers data when(not if) Yahoo! decides to shut it down.
01:30 ๐Ÿ”— arrith1 arkhive: preemptive action, especially crawls, for big sites is good
02:19 ๐Ÿ”— omf_ Anyone a system administrator at an ISP?
02:21 ๐Ÿ”— BlueMax why do you ask, omf_?
02:25 ๐Ÿ”— omf_ Cause it would be nice to have access to the zone files of well used DNS servers
02:25 ๐Ÿ”— arrith1 i might have something like that in a few months
02:26 ๐Ÿ”— arrith1 i'll make a note and get back to you
02:26 ๐Ÿ”— omf_ We always seem to be looking for domain names and subdomain names
02:27 ๐Ÿ”— BlueMax I may, keyword may, know someone who has access to that sort of thing, but this is no guarantee
02:29 ๐Ÿ”— BlueMax damn he's not online
02:29 ๐Ÿ”— omf_ And having someone at an ISP would give us the access level we need to simplify most of these problems. Of course I could set up a DNS server and do it myself but ISP's servers get heavy use so the data is already there
02:58 ๐Ÿ”— namespace omf_: Makes sense.
03:00 ๐Ÿ”— omf_ I used to work at an ISP so I know the data is there and easier to collect as an ISP
03:00 ๐Ÿ”— Aranje is this like query statistics or what
03:02 ๐Ÿ”— omf_ no. It is about discovering domain names without having to crawl sites looking for them
03:03 ๐Ÿ”— omf_ A system admin at a University would have the access needed as well
03:03 ๐Ÿ”— namespace Aranje: Are you around when Archive Team decides to grab a site?
03:04 ๐Ÿ”— namespace If you are then you'd know that the first thing they need to figure out is what to grab.
03:04 ๐Ÿ”— namespace Most big sites are a mess of domains and subdomains.
03:04 ๐Ÿ”— arrith1 crawling takes up a lot of valuable time
03:04 ๐Ÿ”— namespace Being able to decide to archive a site and have it Just Work (TM) would be a real help.
03:07 ๐Ÿ”— namespace http://en.wikipedia.org/wiki/Zone_file
03:11 ๐Ÿ”— Aranje so you're looking to be able to grab a zone file of the site to see if the subdomains are listed out? If yes, why would an ISP have access to that... doesn't only the site hosting have all of that info in one place?
03:12 ๐Ÿ”— Aranje this concept somewhat defies my (weak) knowledge of how dns is done
03:13 ๐Ÿ”— namespace ...
03:13 ๐Ÿ”— namespace Aranje: Think about this for a minute.
03:13 ๐Ÿ”— namespace When you type in www.google.com into your browser, where does that request go to?
03:14 ๐Ÿ”— Aranje my computer
03:14 ๐Ÿ”— namespace ...
03:14 ๐Ÿ”— namespace Oh dear.
03:14 ๐Ÿ”— Aranje because I have unbound running in caching mode
03:14 ๐Ÿ”— namespace Oh.
03:15 ๐Ÿ”— namespace Well after that.
03:15 ๐Ÿ”— Aranje unbound asks... I forget one of the japanese ISP's that has a resolver in san jose
03:15 ๐Ÿ”— Aranje who, if it doesn't have a copy of the record goes and recurses down from . to get it
03:16 ๐Ÿ”— namespace .?
03:16 ๐Ÿ”— Aranje the root
03:16 ๐Ÿ”— namespace Oh.
03:16 ๐Ÿ”— namespace Nevermind.
03:16 ๐Ÿ”— Aranje I'm not retarded, I just haven't eaten recently
03:16 ๐Ÿ”— Aranje lol
03:16 ๐Ÿ”— namespace I thought I had this in my head, but now I'm confused.
03:17 ๐Ÿ”— Aranje I also run the dns shit for our hosting company, which is why I was interested in the first place
03:17 ๐Ÿ”— Aranje likely I can't help, but if I can get a handle on exactly what is needed maybe I know someone who can
03:19 ๐Ÿ”— omf_ An ISP can request the zone file for all .org sites for example from companies like godaddy and verisign
03:19 ๐Ÿ”— Aranje for local caching, I see
03:19 ๐Ÿ”— omf_ that is not something normal people can do because it takes resources to generation that file
03:20 ๐Ÿ”— omf_ I have tried filling out the paperwork to make it happen
03:20 ๐Ÿ”— omf_ They always say the free service is for ISPs not normal people
03:20 ๐Ÿ”— * Aranje nods
03:20 ๐Ÿ”— * Aranje understands now
03:20 ๐Ÿ”— omf_ Lets say I got the TLD .org zone file, I would not have 3+ million domain names
03:21 ๐Ÿ”— omf_ based on previous published numbers
03:22 ๐Ÿ”— omf_ There is no way to get all domain names everywhere but it is possible to get blocks of them measured in the millions
03:23 ๐Ÿ”— arrith1 just have to put together 'omf ISP LLC' ;)
03:24 ๐Ÿ”— Aranje that'd be a fun project
03:24 ๐Ÿ”— omf_ We just need to ask around. There are 170 peeps in here and in that network of knowing people is probably a person who can help us.
03:24 ๐Ÿ”— Aranje sucking in a list of all domain names
03:25 ๐Ÿ”— omf_ Also someone who works at a big company and runs a DNS cache could find us sites as well
03:25 ๐Ÿ”— * Aranje nods
03:26 ๐Ÿ”— omf_ ISPs are best, universities usually get that access too, then large corps with many users would be next
03:27 ๐Ÿ”— omf_ There are places we could buy this information but fuck that
03:27 ๐Ÿ”— omf_ oooohh another idea. Someone who works at a hosting company
03:27 ๐Ÿ”— omf_ Amazon, Joyent, Digital Ocean, Linode, etc... they all provide multi-level DNS and caching
03:28 ๐Ÿ”— Aranje riverbed was the company I was thinking of
03:28 ๐Ÿ”— Aranje I have a friend there
03:29 ๐Ÿ”— omf_ Aranje, let me ask you about unbound
03:29 ๐Ÿ”— Aranje sure, I'll answer if I can
03:29 ๐Ÿ”— omf_ I was thinking having that running on a crawling server could really help speed up large scale grabs
03:30 ๐Ÿ”— omf_ Do you see speedups using it locally
03:30 ๐Ÿ”— Aranje the short answer is yes
03:31 ๐Ÿ”— Aranje I use it because I have charter as an ISP and I can't trust that dns requests will actually succeed, even to other providers
03:31 ๐Ÿ”— Aranje so I cache as much locally as possible
03:31 ๐Ÿ”— Aranje and there's no caching like local caching, for perf
03:31 ๐Ÿ”— omf_ who do point to for peering? opendns?
03:32 ๐Ÿ”— Aranje lemme figure it out, it's a japanese isp
03:32 ๐Ÿ”— Aranje they have an IP in san jose with very very good numbers
03:32 ๐Ÿ”— omf_ for people following along I recommend reading the short info on passive DNS here - https://security.isc.org/
03:33 ๐Ÿ”— Aranje ahh
03:33 ๐Ÿ”— Aranje it's ntt
03:33 ๐Ÿ”— Aranje 129.250.35.250 is the IP I use
03:34 ๐Ÿ”— Aranje officially x.ns.gin.ntt.net
03:34 ๐Ÿ”— Aranje I'm halfway between LA and SF (in San Luis Obispo) and I've had great luck with stuff in SJC
03:34 ๐Ÿ”— omf_ Opera has done a lot of working on building domain lists
03:35 ๐Ÿ”— Aranje there's a dns performance test utility that I ran, and they got not the best speeds but the most stable
03:36 ๐Ÿ”— Aranje I used to run unbound in recursive caching mode, but I found switching to querying someone else gave me another drop in query time
03:37 ๐Ÿ”— Aranje If you have an archive box someplace, it might be fine to just find a dns server in the same datacenter. If that's not available, unbound is a great option.
03:37 ๐Ÿ”— namespace omf_: So what you're saying is that you just need a copy of the file?
03:37 ๐Ÿ”— omf_ Copies of TLD zones files are the best solution since they have "everything"
03:38 ๐Ÿ”— omf_ I just found another service that will offer it but only to researchers and shit
03:38 ๐Ÿ”— omf_ If someone from IA applied they could probably get access. underscor
03:38 ๐Ÿ”— Aranje you can tell unbound to refetch both keys and full records ahead of their expiry as well, greatly reducing having to wait on queries if they're near ttl expiry
03:38 ๐Ÿ”— namespace omf_: I was going to say, just ask anyway.
03:38 ๐Ÿ”— namespace Godaddy/etc have a cheap ass profit motive not to.
03:39 ๐Ÿ”— namespace The free for researchers guys might actually just give it to you if you ask.
03:39 ๐Ÿ”— omf_ This is who I was just referring to https://dnsdb.isc.org/
03:40 ๐Ÿ”— namespace I could ask around on IRC if you want.
03:40 ๐Ÿ”— omf_ please
03:43 ๐Ÿ”— omf_ I already have a few million domain names collected. I plan to build this list up and put it on IA
03:43 ๐Ÿ”— omf_ Most of the URL lists we put up already for sites that closed
03:44 ๐Ÿ”— Aranje omf_:) if you continue being interested, I can give you my unbound config and the list of caveats for it :)
03:44 ๐Ÿ”— namespace Wait, a large should-be-public dataset that you plan to leak for the benefit of the public?
03:44 ๐Ÿ”— omf_ please Aranje I was looking to bundle unbound into a general crawler VM
03:44 ๐Ÿ”— omf_ yep
03:44 ๐Ÿ”— Aranje oh, perfect
03:44 ๐Ÿ”— namespace You can probably find somebody willing to help with that sort of marketing.
03:45 ๐Ÿ”— Aranje I'll remove some of the caveats and pass it over. How much ram will the VM have?
03:45 ๐Ÿ”— namespace Even set up an anonymous dump site.
03:45 ๐Ÿ”— omf_ ooh I should email Malamund
03:45 ๐Ÿ”— namespace dnsleaks.org
03:45 ๐Ÿ”— Aranje :3
03:45 ๐Ÿ”— omf_ for testing purposes 1gb
03:45 ๐Ÿ”— omf_ but I test it on butt providers with 8gb ram
03:47 ๐Ÿ”— Aranje okay, I'll tune some of the numbers down a bit. I think the way I have it set up it can use up to 256mbish of ram
03:47 ๐Ÿ”— Aranje but I cache for the house
03:47 ๐Ÿ”— Aranje it also never approaches that
03:47 ๐Ÿ”— Aranje :D
03:48 ๐Ÿ”— namespace Does Archive Team have a blog?
03:48 ๐Ÿ”— omf_ malware domain lists are already public - http://www.malwaredomainlist.com/mdl.php
03:48 ๐Ÿ”— namespace This sort of thing is why it's a good idea to have one.
03:48 ๐Ÿ”— namespace Even with a thousand subscribers you'd probably have what you wanted in a day or two if you asked on it.
03:48 ๐Ÿ”— Aranje we have jason and well followed twitters
03:49 ๐Ÿ”— namespace Eh, I wouldn't want to bother Jason unless it was really necessary.
03:50 ๐Ÿ”— namespace #jengaforxanga
03:50 ๐Ÿ”— * Aranje makes assumptions about vm's and tunes accordingly
03:50 ๐Ÿ”— omf_ xanga and google reader would both benefit from this work
03:52 ๐Ÿ”— Aranje these vm's... are they debian or ubuntu?
03:52 ๐Ÿ”— omf_ neither
03:52 ๐Ÿ”— Aranje m
03:53 ๐Ÿ”— Aranje homebrew? or just a centos or something
03:53 ๐Ÿ”— omf_ nope
03:53 ๐Ÿ”— * Aranje wishes to tailor his config to the package likely to be installed
03:53 ๐Ÿ”— namespace omf_: Just say it so I don't have to scan my VM.
03:54 ๐Ÿ”— Aranje should I prepare a full zip with all the necessary files?
03:54 ๐Ÿ”— Aranje so it runs in its own directory and gives zero fucks?
03:54 ๐Ÿ”— * Aranje grins
03:54 ๐Ÿ”— omf_ Since most butt providers are dumb I go with the newest Linux they offer. This is usually Fedora but I prefer opensuse since it stays up to date and more stable than most of everything else
03:54 ๐Ÿ”— omf_ why would it matter
03:54 ๐Ÿ”— namespace butt providers?
03:55 ๐Ÿ”— Aranje the location of eg: root.hints and root.key changes
03:55 ๐Ÿ”— Aranje for dnssec validation
03:55 ๐Ÿ”— omf_ check this out namespace - https://github.com/panicsteve/cloud-to-butt it is a running joe
03:55 ๐Ÿ”— omf_ joke
03:55 ๐Ÿ”— omf_ I can handle that Aranje
03:56 ๐Ÿ”— namespace That's great.
03:56 ๐Ÿ”— omf_ we also call them clown hosting
03:56 ๐Ÿ”— namespace Because they're not even funny/
03:56 ๐Ÿ”— namespace *?
03:56 ๐Ÿ”— omf_ because you are a clown for using one
03:56 ๐Ÿ”— namespace Ah.
03:57 ๐Ÿ”— namespace I feel sorry for clows.
03:57 ๐Ÿ”— namespace *clowns
03:57 ๐Ÿ”— omf_ because people think "cloud hosting" solves all their problems
03:57 ๐Ÿ”— namespace Pop culture destroyed that occupation.
03:57 ๐Ÿ”— omf_ I found 3 studies on 1+ million domains and no source data provided.
03:58 ๐Ÿ”— Aranje expected singlecore vm's?
03:58 ๐Ÿ”— omf_ multiple cores
03:58 ๐Ÿ”— Aranje 2 seem like a sane default?
03:59 ๐Ÿ”— omf_ yes
03:59 ๐Ÿ”— Aranje (governs number of threads)
04:15 ๐Ÿ”— Aranje omf_:) do you want them looking up the addresses themselves or how I do it with basically being a caching proxy
04:17 ๐Ÿ”— omf_ caching
04:18 ๐Ÿ”— omf_ anything that makes speed the priority
04:19 ๐Ÿ”— omf_ just got 335,902 more domains
04:19 ๐Ÿ”— Aranje yep. I've run back over my config (found some problems with the one I wrote Ha!) and reread the whole man page while doing so. This one'll have you covered.
04:19 ๐Ÿ”— omf_ I am writing a crawler right now to collect domain lists from sale sites
04:19 ๐Ÿ”— omf_ thanks
04:25 ๐Ÿ”— omf_ Archive Team measures domain name collection in hundreds of thousands, anything less would be uncivilized :)
04:28 ๐Ÿ”— godane i'm making a update grab of thefeed on g4tv.com
04:28 ๐Ÿ”— omf_ I am going to put a crawler together for godaddy. I can get 500 domain names at a time
04:29 ๐Ÿ”— godane look what i have found: http://www.telnetbbsguide.com/dialbbs/dialbbs.htm
04:30 ๐Ÿ”— namespace godane: Most telnet BBS's are empty.
04:30 ๐Ÿ”— omf_ I think I just found a list of all active domains in 2011
04:30 ๐Ÿ”— arrith1 omf_: if you target the Warrior could probably get it very fast
04:32 ๐Ÿ”— omf_ jackpot bitches
04:33 ๐Ÿ”— omf_ I just got 90 million unique domain names
04:33 ๐Ÿ”— omf_ 90,000,000 <- that is a lot of zeros
04:34 ๐Ÿ”— omf_ and that is just the .com list
04:35 ๐Ÿ”— omf_ 14 million .org
04:40 ๐Ÿ”— omf_ So now I have a nice big clean dataset to share
04:45 ๐Ÿ”— omf_ this is going to take a while to download
05:00 ๐Ÿ”— Sue|phone Omf_: aranje lost power
05:02 ๐Ÿ”— omf_ oh
05:06 ๐Ÿ”— omf_ also got .au, .ru, .net, .info, .us, .ca, .de and others
05:21 ๐Ÿ”— arrith1 omf_: gj
05:26 ๐Ÿ”— namespace Ugh, that moment when you've been grabbing for like a day and realize all your grabs are contentless.
05:27 ๐Ÿ”— arrith1 namespace: should really be a tool for that
05:28 ๐Ÿ”— arrith1 could use loose ml, something like 'these are good files, if later files differ significantly, notify me somehow'
05:30 ๐Ÿ”— namespace Yeah.
05:30 ๐Ÿ”— namespace We need to code one up.
05:30 ๐Ÿ”— namespace That was really frustrating.
05:31 ๐Ÿ”— namespace I'm just glad I checked before I'd grabbed everything.
05:31 ๐Ÿ”— namespace That would have been such a waste.
05:31 ๐Ÿ”— arrith1 maybe even have an option to be somewhat specific like "notify me if 2% differs, 5%, etc'
05:31 ๐Ÿ”— namespace Eh.
05:31 ๐Ÿ”— namespace You could make it even simpler.
05:31 ๐Ÿ”— arrith1 namespace: i'm actually wondering that now. Google Reader hasn't really ratelimited and is honest about http codes but it's too much to check manually in time
05:31 ๐Ÿ”— namespace If the files start coming out sub a certain amount of memory let me now.
05:32 ๐Ÿ”— namespace *know
05:32 ๐Ÿ”— arrith1 namespace: oh like smaller than some byte size/count
05:32 ๐Ÿ”— namespace Like 9KB files probably aren't right.
05:32 ๐Ÿ”— arrith1 "less than 4 KB, 2 KB"
05:32 ๐Ÿ”— arrith1 ah, yeah, less than 1 MB or 0.5 MB even sometimes
05:33 ๐Ÿ”— namespace Mine are less than a megabyte but probably more than ten kilobytes.
05:33 ๐Ÿ”— arrith1 that wouldn't even take ml, just a periodic thing and a notification system
05:33 ๐Ÿ”— namespace Yeah, that's what I'm saying.
05:33 ๐Ÿ”— namespace It could be a bash script.
05:34 ๐Ÿ”— arrith1 how to notify though is what i'm wondering. since people run things headless a lot
05:34 ๐Ÿ”— namespace Though python or some such would be better for the sake of not having to deal with bash.
05:34 ๐Ÿ”— namespace Email?
05:34 ๐Ÿ”— namespace System beep?
05:34 ๐Ÿ”— arrith1 bash gets the job done :P
05:34 ๐Ÿ”— arrith1 maybe writing out an error message to a file and "touch STOP" ing
05:37 ๐Ÿ”— arrith1 or killing the process, if the process doesn't support "touch STOP"
05:38 ๐Ÿ”— arrith1 could have an analysis thing where if you have a few known good files, or a directory of known good files, scans them and suggests rounded values to use
05:39 ๐Ÿ”— arrith1 should use inotify on linux
05:54 ๐Ÿ”— arrith1 alright on my TODO list to write that in python. but no one feel like they shouldn't write one if they want to
08:14 ๐Ÿ”— namespace Ugh. Page grabs work when I use wget, but a full grab gets me contentless crap.
08:14 ๐Ÿ”— namespace *use wget on a single page
08:16 ๐Ÿ”— arrith1 namespace: some kind of ratelimiting?
08:16 ๐Ÿ”— namespace arrith1: Maybe.
08:16 ๐Ÿ”— namespace It's a vbulletin forum.
08:17 ๐Ÿ”— namespace I'm waiting ten seconds between items.
08:17 ๐Ÿ”— arrith1 well test out the URLs in the browser, also could be sending diff stuff depending on user agent, could try the fx addon user agent switcher
08:17 ๐Ÿ”— namespace Yeah, I was gonna try to use a different user agent.
08:17 ๐Ÿ”— arrith1 i know lots of forums send different data to user agent claiming to be crawlers, like will require registration unless the ua is googlebot
08:18 ๐Ÿ”— namespace I am registered.
08:18 ๐Ÿ”— namespace And a long-time member, for that matter.
08:18 ๐Ÿ”— arrith1 namespace: is your wget supplying cookies?
08:18 ๐Ÿ”— namespace arrith1: Yup.
08:19 ๐Ÿ”— namespace And yes they're still valid.
08:19 ๐Ÿ”— arrith1 welll, could just be a poorly coded user-agent hack thing. i'd use your browser UA and/or try fx user agent addon switcher
08:19 ๐Ÿ”— arrith1 namespace: changed your browser ua?
08:19 ๐Ÿ”— namespace Hmm.
08:19 ๐Ÿ”— namespace I could try changing my browser UA to test.
08:19 ๐Ÿ”— ivan` what exactly is the contentless crap?
08:20 ๐Ÿ”— SilSte2 Hi @ all
08:20 ๐Ÿ”— SilSte2 Tunewiki is closing...
08:20 ๐Ÿ”— SilSte2 http://www.tunewiki.com/news/186/tunewiki-is-shutting-down
08:22 ๐Ÿ”— namespace ivan`: A message you get when you look at the sites index.html
08:23 ๐Ÿ”— SilSte and is there a problem with xanga? Stopped getting new items...
08:24 ๐Ÿ”— arrith1 SilSte: might be. a few users are reporting issues. xanga discussion in #jenga
08:24 ๐Ÿ”— arrith1 SilSte: could make a page on the ArchiveTeam wiki for tunewiki if you want
08:24 ๐Ÿ”— SilSte I'm unsure if its important enought ^^
08:24 ๐Ÿ”— arrith1 namespace: hm so sometimes you're able to get pages from wget, but not successively
08:24 ๐Ÿ”— SilSte And there are only 4 days left...
08:25 ๐Ÿ”— arrith1 SilSte: only thing that makes a site important enough is that people want to save it
08:25 ๐Ÿ”— arrith1 hmm
08:25 ๐Ÿ”— namespace ^ THis
08:26 ๐Ÿ”— namespace I think my options were:
08:27 ๐Ÿ”— namespace ./wget --warc-file --no-parent --mirror -w 10 --limit-rate 56k --verbose --load-cookies
08:27 ๐Ÿ”— namespace (URL's redacted for privacy reasons.)
08:30 ๐Ÿ”— arrith1 namespace: -erobots=off, also if you don't specify a ua then it does wget. i'd put money on vbulletin shipping with something to handle wget UAs
08:30 ๐Ÿ”— namespace The robots.txt has nothing in it, basically.
08:31 ๐Ÿ”— namespace Except for a 2 minute crawl limit.
08:31 ๐Ÿ”— namespace Or something like that.
08:31 ๐Ÿ”— SilSte wikisecretword?
08:31 ๐Ÿ”— arrith1 namespace: could set your delay time to 2min and set your UA to a normal firefox/chrome UA
08:31 ๐Ÿ”— arrith1 SilSte: yahoosucks
08:31 ๐Ÿ”— arrith1 SilSte: you didn't say the line btw :P
08:31 ๐Ÿ”— SilSte thx
08:31 ๐Ÿ”— namespace arrith1: I'll try that.
08:32 ๐Ÿ”— namespace How do I get a firefox UA?
08:32 ๐Ÿ”— arrith1 google for "what is my UA" and pages will show you
08:32 ๐Ÿ”— namespace I know how to switch it of course.
08:32 ๐Ÿ”— namespace Ah, okay.
08:32 ๐Ÿ”— arrith1 so you can use your actual browser UA
08:33 ๐Ÿ”— arrith1 also like livehttpheaders fx plugin has that info, wireshark would also show it. might be some about:config thing that says it. also copying/pasting from firefox useragent switcher addon, but using your own is more stealthy if you visit a site a lot
08:36 ๐Ÿ”— namespace At 2 mins a page this should take a few months. :P
08:37 ๐Ÿ”— arrith1 namespace: hmm, well one upside is you probably would be totally safe from triggering any ratelimiting heh
08:37 ๐Ÿ”— namespace I'm more afraid that the connection would time out before I finish.
08:38 ๐Ÿ”— namespace Remember, WARC has no timestamping.
08:38 ๐Ÿ”— arrith1 namespace: might want to save wget logs then
08:38 ๐Ÿ”— SilSte namespace: arrith: what are you doing?
08:38 ๐Ÿ”— arrith1 namespace: btw you can do retries, if it times out then it'll retry
08:39 ๐Ÿ”— arrith1 SilSte: namespace is saving some forums he likes a lot, and i'm working on ways to get RSS/atom feed urls for the Google Reader effort. right now that means a crawler to get usernames from livejournal
08:40 ๐Ÿ”— SilSte kk
08:40 ๐Ÿ”— Coderjoe qw3rty+P3R50N4L
08:40 ๐Ÿ”— Coderjoe hoshit
08:41 ๐Ÿ”— SilSte I made a Wikisite
08:41 ๐Ÿ”— SilSte http://www.archiveteam.org/index.php?title=TuneWiki
08:42 ๐Ÿ”— SilSte I'm not familiar with building the warrior etc... just wanted to inform ^^
08:42 ๐Ÿ”— namespace arrith1: Yeah, two minutes isn't happening.
08:42 ๐Ÿ”— arrith1 haha
08:43 ๐Ÿ”— namespace Sadly it's not doable on a sub 1200 baud modem.
08:44 ๐Ÿ”— SilSte I think the downloading of tunewiki should be kind of easy... there are list with the artists ;-)
08:44 ๐Ÿ”— arrith1 namespace: well you could try without a limit, if all it cares about is the UA. if you aren't downloading it a bunch then one quick dl shouldn't be too noticeable
08:44 ๐Ÿ”— namespace This is my third attempt.
08:44 ๐Ÿ”— arrith1 SilSte: would you like to download tunewiki? with a single wget command you probably could grab it all
08:44 ๐Ÿ”— namespace I think it might be a little noticeable.
08:44 ๐Ÿ”— SilSte if you tell me how... i can try
08:45 ๐Ÿ”— arrith1 SilSte: do you have access to a terminal on a linux machine or VM?
08:45 ๐Ÿ”— SilSte i can install a ubuntu vm
08:45 ๐Ÿ”— namespace Okay.
08:45 ๐Ÿ”— namespace Here's the manual.
08:45 ๐Ÿ”— arrith1 SilSte: sure, any linux you're familiar with
08:45 ๐Ÿ”— namespace https://www.gnu.org/software/wget/manual/wget.html
08:46 ๐Ÿ”— namespace It's super boring, but it'll tell you everything you'd want to know.
08:46 ๐Ÿ”— namespace (Assuming you sort of know how a web server works.)
08:46 ๐Ÿ”— arrith1 namespace: if you do "--warc-file" and don't specify anything, does it make up its own name?
08:46 ๐Ÿ”— namespace arrith1: I don't think so.
08:46 ๐Ÿ”— namespace I never tried it.
08:47 ๐Ÿ”— namespace I just redacted the file name too, for the same reasons.
08:47 ๐Ÿ”— arrith1 i think that would be neat. just take whatever wget is calling the file and append "warc.gz"
08:47 ๐Ÿ”— arrith1 ah
08:47 ๐Ÿ”— arrith1 SilSte: make sure the virtual hard drive of the VM has enough space to save a big site
08:48 ๐Ÿ”— arrith1 SilSte: could do some 500 GB, since the VM won't take up all the space until it needs it
08:48 ๐Ÿ”— SilSte arrith1: is ubuntu good or would yo prefer debian?
08:48 ๐Ÿ”— namespace Seems to be working.
08:48 ๐Ÿ”— arrith1 SilSte: whichever you're more familiar with
08:48 ๐Ÿ”— arrith1 namespace: what delay?
08:51 ๐Ÿ”— namespace 10 seconds
08:52 ๐Ÿ”— namespace I think the user agent change fixed it.
08:52 ๐Ÿ”— namespace I also turned off the rate limit.
08:52 ๐Ÿ”— arrith1 namespace: could try without the limit >:)
08:52 ๐Ÿ”— arrith1 heh
08:52 ๐Ÿ”— namespace Because bandwidth isn't the bottleneck.
08:52 ๐Ÿ”— arrith1 if the site doesn't care, and can spare the bw
08:52 ๐Ÿ”— arrith1 yeah
08:52 ๐Ÿ”— arrith1 that's always nice. google reader downloading has been like that
08:53 ๐Ÿ”— namespace The bottleneck is the wait time, which I have set to ten because it's not like the site is going anywhere.
08:58 ๐Ÿ”— arrith1 SilSte: http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
08:58 ๐Ÿ”— arrith1 SilSte: http://pad.archivingyoursh.it/p/wget-warc
09:04 ๐Ÿ”— arrith1 SilSte: need to compile wget 1.14 (apt-get install build-essential openssl libssl-dev; tar xvf wget.tgz; cd wget; ./configure --with-ssl=openssl; make) then use the binary <wget dir>/src/wget, and can use this as a template: wget --warc-file=tunewiki --no-parent --mirror -w 2 -U "Wget/1.14 gzip ArchiveTeam" https://www.tunewiki.com
09:04 ๐Ÿ”— SilSte arrith its still installing ;-)
09:05 ๐Ÿ”— SilSte okay
09:05 ๐Ÿ”— SilSte thx
09:05 ๐Ÿ”— arrith1 SilSte: alright. hardest part is compiling wget probably. you might get slowed/down blocked depending on how the site responds. if the site seems to be not limiting you, you can take off the "-w 2"
09:06 ๐Ÿ”— SilSte ok
09:06 ๐Ÿ”— SilSte how will I know that they are limiting?
09:07 ๐Ÿ”— arrith1 SilSte: if you have terminal questions you can ask in #ubuntu on the Freenode irc network
09:07 ๐Ÿ”— arrith1 SilSte: the command will probably error out. oh yeah, you might want to put in a retries thing: --tries=10
09:09 ๐Ÿ”— arrith1 SilSte: generally trial and error, you can ask in here or in #ubuntu or #debian on Freenode. one thing to keep in mind is if a binary isn't in your PATH then you need to specify the full path to it, so /home/user/wget_build/wget-1.14/src/wget --mirror <etc>
09:09 ๐Ÿ”— arrith1 SilSte: another channel to be in on this network is #archiveteam-bs
09:09 ๐Ÿ”— SilSte ok
09:11 ๐Ÿ”— Smiley thats here all the crap talk is.
09:13 ๐Ÿ”— namespace Yeah, it's working now.
09:25 ๐Ÿ”— SilSte arrith1: Wget is already installed in 1.14. Is it ok then?
09:25 ๐Ÿ”— namespace Yeah.
09:25 ๐Ÿ”— Smiley yah should be.
09:26 ๐Ÿ”— namespace If it supports --warc-file it's good.
09:26 ๐Ÿ”— SilSte ok
09:26 ๐Ÿ”— arrith1 SilSte: do "man wget" and look for warc
09:26 ๐Ÿ”— namespace That's the new option they added last year.
09:26 ๐Ÿ”— arrith1 SilSte: or wget --help
09:27 ๐Ÿ”— SilSte looks good
09:29 ๐Ÿ”— arrith1 SilSte: good to hear!, that's the benefit of installing a newer version of a distro i guess
09:29 ๐Ÿ”— SilSte installed 13.04 server
09:29 ๐Ÿ”— arrith1 SilSte: you should be good to go, just make a dir, cd in, and try some wget stuff. that one i said should work as-is, but you can add what you want
09:29 ๐Ÿ”— SilSte but it looks a little bit slow...
09:29 ๐Ÿ”— arrith1 SilSte: the download or the vm?
09:30 ๐Ÿ”— SilSte download
09:30 ๐Ÿ”— arrith1 SilSte: btw i hope your hdd is huge
09:30 ๐Ÿ”— SilSte it makes about one step per sec
09:30 ๐Ÿ”— arrith1 SilSte: you can remove the "-w 2" so it doesn't wait
09:30 ๐Ÿ”— SilSte i made a 600GB file
09:30 ๐Ÿ”— arrith1 good :)
09:31 ๐Ÿ”— SilSte will this be fast enough with that around 1 item per sec?
09:31 ๐Ÿ”— ivan` "Wget/1.14 gzip" is a special string for Google, wget does not actually support gzip
09:31 ๐Ÿ”— SilSte it made about 2MB right now...
09:32 ๐Ÿ”— arrith1 ivan`: ah, i meant to have the "ArchiveTeam" in there and copypasted heh
09:32 ๐Ÿ”— SilSte and is it possible to pause and restart wget? Or will it start over?
09:32 ๐Ÿ”— arrith1 SilSte: that's a thing that limits how fast wget goes
09:34 ๐Ÿ”— SilSte okay ... so i can try without... if I need to stop. Does it restart?
09:35 ๐Ÿ”— arrith1 SilSte: i don't know for sure, but i think it would be fine. to stop and resume, the stuff --mirror turns on should be fairly comprehensive
09:35 ๐Ÿ”— arrith1 SilSte: probably best to leave it on, and let it go as fast as the site will let it
09:35 ๐Ÿ”— arrith1 SilSte: ctrl-c to stop
09:36 ๐Ÿ”— SilSte arrith1: "probably best to leave it on, and let it go as fast as the site will let it" with or without -w 2? Atm its on
09:36 ๐Ÿ”— SilSte I'm getting some 404s... is that okay?
09:38 ๐Ÿ”— arrith1 SilSte: yeah. could try without the -w
09:41 ๐Ÿ”— SilSte now its running ^^
09:42 ๐Ÿ”— SilSte as long as its returning 200s everything should be fine I think
09:42 ๐Ÿ”— arrith1 SilSte: yep
09:42 ๐Ÿ”— SilSte if they blacklist me... it will just stop?
09:42 ๐Ÿ”— arrith1 SilSte: might want to keep the directory you use fairly clean, so for each attempt could make a new directory
09:43 ๐Ÿ”— arrith1 SilSte: that's one way to blacklist. another is to show data that isn't the real data from the site (garbage data), or slowing it down
09:43 ๐Ÿ”— arrith1 SilSte: periodically it would be good to check the data you're getting
09:43 ๐Ÿ”— SilSte how can i do this?
09:43 ๐Ÿ”— SilSte ahh
09:43 ๐Ÿ”— SilSte found the folder
09:43 ๐Ÿ”— arrith1 SilSte: you can setup a shared folder between your host and guest and copy files from the guest to the host to view them
09:44 ๐Ÿ”— arrith1 SilSte: yeah, something like ~/archiveteam/tunewiki/attempt1 then attempt2, attempt3, etc
09:44 ๐Ÿ”— SilSte y different attemps? ^^
09:45 ๐Ÿ”— arrith1 SilSte: sometimes you want to start over or start fresh
09:45 ๐Ÿ”— SilSte hmmm k
09:45 ๐Ÿ”— arrith1 maybe if one attempt was going wrong for some reason
09:45 ๐Ÿ”— SilSte i will run it now and watch back later ;-)
09:46 ๐Ÿ”— arrith1 good luck :)
09:48 ๐Ÿ”— SilSte I'm getting some "is not a directory"- errors
09:49 ๐Ÿ”— arrith1 after GR is down, spidering feedburner would be good. means i need to find solid crawling software. gnu parallel's example page has an interesting section on wget as a parallel crawler
09:49 ๐Ÿ”— arrith1 SilSte: hm odd, google errors you're curious about
09:49 ๐Ÿ”— SilSte kk
09:49 ๐Ÿ”— SilSte thx for your help <3
11:06 ๐Ÿ”— Nemo_bis Why not disallow all log/ dir now that it exists? https://catalogd.us.archive.org/robots.txt
11:23 ๐Ÿ”— ivan` SketchCow: thanks for that retweet
11:29 ๐Ÿ”— Nemo_bis I'm sure if these are complete :S https://ia601803.us.archive.org/zipview.php?zip=/12/items/ftp-ftp.hp.com_ftp1/graham.zip https://ia601803.us.archive.org/zipview.php?zip=/12/items/ftp-ftp.hp.com_ftp1/catia.zip
12:46 ๐Ÿ”— Smiley Ok, anyone alive to assist with me trying to setup this EC2 stuff?
12:46 ๐Ÿ”— Smiley I can't even ssh in even though the rules seem ok D:
12:47 ๐Ÿ”— Smiley Oh ffs, fixed that.
12:58 ๐Ÿ”— Smiley Ok - next person who can help me getting this EC2 instance up and running please let me know. I have ubuntu server 13. something, wget 1.14, tornado, the seesaw stuff, all done.
13:45 ๐Ÿ”— SilSteStr test
13:45 ๐Ÿ”— SilSteStr okay
13:50 ๐Ÿ”— SilSteStr gnah... wget does not continue the warc file -.-
13:51 ๐Ÿ”— ivan` you can make a new warc and cat it onto the old one
13:52 ๐Ÿ”— SketchCow No problem.
13:52 ๐Ÿ”— SketchCow I'm packing up soon to drive south
13:53 ๐Ÿ”— PepsiMax Any sources about xanga.com dying?
13:53 ๐Ÿ”— PepsiMax I don't mind doing some CPU cycles/traffic for ArchiveTeam, but is it worth it? :-)
13:53 ๐Ÿ”— PepsiMax - if they dont dy :P
13:53 ๐Ÿ”— PepsiMax die
13:54 ๐Ÿ”— SilSteStr ivan`: how?
13:55 ๐Ÿ”— SilSteStr atm my comman is "wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com
13:56 ๐Ÿ”— Smiley PepsiMax: either you do it, or you don't. We aren't here to convince you, but it's going.
13:56 ๐Ÿ”— Smiley If you wanted to find it, you simpl;y need to visit the damn site itself.
13:57 ๐Ÿ”— SketchCow So, XANGA.COM's Xanga Team log posted this entry a while ago:
13:58 ๐Ÿ”— SketchCow http://thexangateam.xanga.com/773587240/relaunching-xanga-a-fundraiser/
13:58 ๐Ÿ”— SketchCow * May 30th: We launch this fundraiser, and continue our work building a WordPress version of Xanga.
13:59 ๐Ÿ”— SketchCow * Through July 15th: We will contact our registered members to let them know about the fundraiser, and also allow any and all users to download their blogs and media files for free.
13:59 ๐Ÿ”— SketchCow * July 15th: This will be the final day for the fundraiser.
13:59 ๐Ÿ”— SketchCow If we have a successful fundraiser:
13:59 ๐Ÿ”— SketchCow * July 15th: If we've raise $60k, then we will move over to the new WordPress version on this date.
13:59 ๐Ÿ”— SketchCow If the fundraiser isn't successful:
13:59 ๐Ÿ”— SketchCow * July 15th: If we haven't raised $60k, then this will be the last date that Xanga is up and running.
13:59 ๐Ÿ”— SketchCow ...
13:59 ๐Ÿ”— SketchCow So, that means that either 1. They're going to delete everything, or 2. They're going to utterly move everything to a new platform, which leads to lost formatting, items, and who knows what else.
14:05 ๐Ÿ”— SketchCow http://thexangateam.xanga.com/774035061/update-on-relaunch-xanga-fundraiser-and-xanga-archives-news/ is a follow up post. They indicate how to download the blogs if you want to, but again they are not clear of how the blogs would change.
14:23 ๐Ÿ”— winr4r yeah, and even if it's 2), they're going to a paid-account model, which means that all the old free users would probably lose all of their shit
14:23 ๐Ÿ”— winr4r well, anyone that didn't sign up for the new paid service would lose their shit
14:24 ๐Ÿ”— winr4r they're a bit unclear on that count, but that's what i've extrapolated from what they have *not* said
14:28 ๐Ÿ”— SketchCow Right. This is all a mountain of uncertainty, making a backup worth doing.
14:28 ๐Ÿ”— SketchCow And now PepsiMax got the learns
14:29 ๐Ÿ”— SilSteStr if I need to stop my wget... what shall i do to prevent a total restart?
14:30 ๐Ÿ”— winr4r is --no-clobber what you are looking for?
14:30 ๐Ÿ”— Smiley eheeerrrghh what?
14:30 ๐Ÿ”— Smiley ec2-bundle-vol has created loaaaaaaads of files
14:31 ๐Ÿ”— Smiley image.part.{00..57}
14:31 ๐Ÿ”— Smiley and a 10Gb image file too
14:32 ๐Ÿ”— SilSteStr winr4r: don't know... but wget stopped downloading... I stopped it, startet it again and now the WARC File begins from the beginning....
14:32 ๐Ÿ”— PepsiMax SketchCow: http://pphilip.xanga.com/774075894/your-blog-is-not-useless/
14:32 ๐Ÿ”— PepsiMax sounds a bit good to be true :P
14:36 ๐Ÿ”— Smiley errrr
14:36 ๐Ÿ”— Smiley what are you trying to prove PepsiMax ?
14:38 ๐Ÿ”— Smiley "The unfortunate thing about the xanga archives is that the html is hardcoded to link to images on the xanga servers - which will no longer be there. So you will have the text of your blogs - and comments - but you will not easily be able to find what pictures go with each blog entry after the xanga servers go down."
14:38 ๐Ÿ”— Smiley Fail.
14:41 ๐Ÿ”— winr4r SilSteStr: the --continue/-c option might do what you want, don't know how that plays with WARC though
14:44 ๐Ÿ”— SilSteStr winr4r: its not working with WARC...
14:44 ๐Ÿ”— winr4r :<
14:45 ๐Ÿ”— omf_ continue does not work with warc
14:46 ๐Ÿ”— SilSteStr so do I really have to start over each time? Oo
14:47 ๐Ÿ”— Smiley hmmmm kind of
14:47 ๐Ÿ”— Smiley you should log the finished urls and then exclude them.
14:47 ๐Ÿ”— Smiley :D
14:48 ๐Ÿ”— SilSteStr Oo
14:48 ๐Ÿ”— SilSteStr I'm not really familiar...
14:50 ๐Ÿ”— Smiley me nither, but it'd work I think :D
14:51 ๐Ÿ”— * omf_ poke ivan`
14:51 ๐Ÿ”— SilSteStr lol
14:52 ๐Ÿ”— omf_ the problem with Smiley's idea is that wget limits how many urls you can skip because it is junk
14:52 ๐Ÿ”— Smiley Doh!
14:56 ๐Ÿ”— Aranje omf_:) sorry, power outage. dunno if sue said anything, I asked him to.
14:56 ๐Ÿ”— omf_ I got the message A
14:57 ๐Ÿ”— Aranje kk. <3 sue
14:57 ๐Ÿ”— omf_ no worries
14:57 ๐Ÿ”— omf_ I am currently sucking down 150 million domain names
14:58 ๐Ÿ”— ivan` sup
14:59 ๐Ÿ”— omf_ I got 150 million unique domain names ivan
14:59 ๐Ÿ”— ivan` great
14:59 ๐Ÿ”— ivan` can I grab them yet?
14:59 ๐Ÿ”— omf_ I am still downloading the lists
14:59 ๐Ÿ”— ivan` alright
15:00 ๐Ÿ”— omf_ They are broken into blocks of 5000 for easy management
15:00 ๐Ÿ”— omf_ I also got all the urls from dmoz and 350,000 from a domain sale site
15:02 ๐Ÿ”— ivan` for ameblo.jp,blog.livedoor.jp,feeds.feedburner.com,feeds2.feedburner.com,feeds.rapidfeeds.com,blog.roodo.com it would be super-great if you could get the thing after the first slash
15:02 ๐Ÿ”— ivan` groups.yahoo.com/group/,groups.google.com/group/,www.wretch.cc/blog/ second slash
15:03 ๐Ÿ”— ivan` youtube.com/user/ second slash but I don't know if I'll get to those, seems kind of low value anyway
15:03 ๐Ÿ”— SilSteStr omf_: what are you doing?
15:03 ๐Ÿ”— omf_ collecting domain names which I plan to release as a data set on IA
15:03 ๐Ÿ”— SilSteStr omf_: So what shall I use instead?
15:04 ๐Ÿ”— ivan` thanks omf_
15:04 ๐Ÿ”— SilSteStr omf_: kk
15:04 ๐Ÿ”— omf_ The normal big lists are only 1 million domains total and there are only 2 of those lists public
15:05 ๐Ÿ”— omf_ basically someone could seriously start a search engine using this list
15:05 ๐Ÿ”— Aranje omf_:) http://pastebin.com/5y0aemPs
15:06 ๐Ÿ”— Aranje primary assumption: installed on each node, not centrally
15:07 ๐Ÿ”— omf_ that is correct Aranje
15:07 ๐Ÿ”— Aranje wonderful :)
15:07 ๐Ÿ”— * Aranje fixes local config based on changes
15:07 ๐Ÿ”— * Aranje grins
15:08 ๐Ÿ”— Smiley http://www.governmentattic.org/8docs/NSA-WasntAllMagic_2002.pdf
15:08 ๐Ÿ”— Smiley http://www.governmentattic.org/8docs/NSA-TrafficAnalysisMonograph_1993.pdf
15:08 ๐Ÿ”— Smiley someone go get em and submit to IA plz
15:09 ๐Ÿ”— * Smiley won't as he's going to get his train now.
15:09 ๐Ÿ”— omf_ got em
15:10 ๐Ÿ”— omf_ site is down again
15:10 ๐Ÿ”— omf_ too much HN/reddit traffic
15:13 ๐Ÿ”— ivan` my 22GB/2.4 billion commoncrawl set http://204.12.192.194:32047/common_crawl_index_urls.bz2 will be up for another week, I do not really know how/where to upload to IA
15:13 ๐Ÿ”— omf_ I can take care of that for you ivan` if you want.
15:14 ๐Ÿ”— GLaDOS I
15:14 ๐Ÿ”— GLaDOS 'll start a fetch for it onto anarchive
15:14 ๐Ÿ”— omf_ good idea GLaDOS
15:15 ๐Ÿ”— omf_ ivan`, how should I get this csv of domains to you?
15:16 ๐Ÿ”— ivan` omf_: a torrent would be most convenient but just about anything will work
15:16 ๐Ÿ”— ivan` how big is it?
15:16 ๐Ÿ”— omf_ this is just the 335k list
15:17 ๐Ÿ”— omf_ 12mb uncompressed
15:17 ๐Ÿ”— ivan` if it's <1GB http://allyourfeed.ludios.org:8080/
15:17 ๐Ÿ”— ivan` heh
15:18 ๐Ÿ”— omf_ done
15:19 ๐Ÿ”— ivan` got it, thanks
15:20 ๐Ÿ”— Aranje I like this config better than the one I was using >_>
15:33 ๐Ÿ”— SilSteStr ivan`: are there thing s.o. can help with google?
15:39 ๐Ÿ”— ivan` SilSteStr: yes, we really need good query lists that will find more feeds using Reader's Feed Directory
15:39 ๐Ÿ”— ivan` n-grams, obscure topics, words in every language, etc
15:40 ๐Ÿ”— ivan` some of the sites listed on http://www.archiveteam.org/index.php?title=Google_Reader need to be spidered to find more users
15:40 ๐Ÿ”— ivan` I can put up a list of every query that's been imported into greader-directory-grab for inspiration
15:41 ๐Ÿ”— SilSteStr so i should run the "greader directory grab"?
15:41 ๐Ÿ”— ivan` sure
15:42 ๐Ÿ”— ivan` it does not do the tedious work of finding things to search for, however ;)
15:44 ๐Ÿ”— SilSteStr how may i help there?
15:46 ๐Ÿ”— ivan` you can google for big lists of things, see also wikipedia's many lists, and make clean lists of queries
15:46 ๐Ÿ”— ivan` the queries get plugged into https://www.google.com/reader/view/#directory-page/1 - you can see if you get good results
15:47 ๐Ÿ”— SilSteStr ./o\ I need an google acc then :D
15:47 ๐Ÿ”— ivan` we also need 2-grams for all the languages, that is, word pairs
15:47 ๐Ÿ”— ivan` indeed
15:47 ๐Ÿ”— SilSteStr wantet me to log in ^^
15:47 ๐Ÿ”— SilSteStr lets continue in the other channel ^^
15:47 ๐Ÿ”— ivan` yep
16:46 ๐Ÿ”— SilSteStr I'm still getting those "is not a directory" failure with tunewiki :(. It's also telling: "Cannot write to XY" (Success) ...
16:47 ๐Ÿ”— SilSteStr http://snag.gy/EHokz.jpg (german sry)
16:47 ๐Ÿ”— SilSteStr any ideas?
16:47 ๐Ÿ”— SilSteStr wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com
16:47 ๐Ÿ”— SilSteStr is the command i used
20:00 ๐Ÿ”— namespace SilSteStr: Why'd you turn off robots.txt?
20:00 ๐Ÿ”— namespace Is their robots.txt stupid or?
20:00 ๐Ÿ”— SilSteStr read this somewhere :D
20:01 ๐Ÿ”— namespace It could also be the User Agent.
20:02 ๐Ÿ”— namespace Oh I think I get it.
20:02 ๐Ÿ”— namespace You need to turn off --no-parent
20:02 ๐Ÿ”— namespace Tunewiki here would be the root directory.
20:02 ๐Ÿ”— namespace So there's no point in having it on and it might be messing it up.
20:02 ๐Ÿ”— SilSteStr hmmm
20:03 ๐Ÿ”— SilSteStr so... should i delete everything?
20:03 ๐Ÿ”— namespace Wait, is it a wikia forum?
20:03 ๐Ÿ”— namespace No, it's not.
20:03 ๐Ÿ”— namespace Does "everything" have no data?
20:04 ๐Ÿ”— namespace You can check with a browser.
20:05 ๐Ÿ”— SilSteStr no
20:05 ๐Ÿ”— SilSteStr only some....
20:05 ๐Ÿ”— namespace I wouldn't delete it if it's got data.
20:05 ๐Ÿ”— SilSteStr and i'm not sure if there is no data...
20:05 ๐Ÿ”— namespace I just said check with a browser.
20:05 ๐Ÿ”— SilSteStr it just tells at some points that there is a "is not a directory" failure...
20:06 ๐Ÿ”— namespace ...
20:06 ๐Ÿ”— namespace But it keeps grabbing?
20:06 ๐Ÿ”— SilSteStr yes
20:06 ๐Ÿ”— SilSteStr i googled this...
20:06 ๐Ÿ”— SilSteStr one sec
20:07 ๐Ÿ”— SilSteStr i found this
20:07 ๐Ÿ”— SilSteStr http://superuser.com/questions/266112/mirroring-a-wordpress-site-with-wget
20:07 ๐Ÿ”— SilSteStr but I don't know how to fix it...
20:07 ๐Ÿ”— Tephra SilSteStr: I had that problem when there was a file name foo and then a directory named foo and wget tried to download to foo/bar, it couldn't create the directory foo since the file foo existed
20:07 ๐Ÿ”— Tephra if that makes sense
20:08 ๐Ÿ”— SilSteStr Tephra: I think this is the problem...
20:08 ๐Ÿ”— namespace Man, archive teams combined knowledge could be used for some serious patches to wget.
20:09 ๐Ÿ”— Tephra SilSteStr: can you try with --no-clobber ?
20:09 ๐Ÿ”— namespace Not being able to resolve a file name duplication is fail.
20:09 ๐Ÿ”— SilSteStr ok... I will try it in another folder...
20:10 ๐Ÿ”— SilSteStr I'm already at h... ;-)
20:10 ๐Ÿ”— Tephra namespace: yes and a serious pain in the ass, when you have been grabbing something for 1 h then start seeing that message
20:11 ๐Ÿ”— SilSteStr I'm also not really sure if everything is fine... after about 6 hours the warc file has only 150MB...
20:13 ๐Ÿ”— SilSteStr I get the failure
20:13 ๐Ÿ”— SilSteStr "Timestamp" and "Overwriting old files" is at the same time not impossible
20:14 ๐Ÿ”— SilSteStr (in German ^^, -> translated)
20:14 ๐Ÿ”— Tephra do you mean possible?
20:17 ๐Ÿ”— Tephra SilSteStr: try --force-directories maybe?
20:17 ๐Ÿ”— SilSteStr รƒยคhh
20:17 ๐Ÿ”— SilSteStr yes
20:17 ๐Ÿ”— SilSteStr okay
20:17 ๐Ÿ”— SilSteStr its running now
20:18 ๐Ÿ”— SilSteStr same problem again
20:18 ๐Ÿ”— Tephra seems like there's a bug filed: http://savannah.gnu.org/bugs/?29647
20:19 ๐Ÿ”— SilSteStr in 2010 Oo
20:19 ๐Ÿ”— SilSteStr "www.tunewiki.com/lyrics/rihanna: Is not a directory www.tunewiki.com/lyrics/rihanna/diamons: Is not a directory"
20:20 ๐Ÿ”— SilSteStr "Cannot write to "www.tunewiki.com/lyrics/rihanna(diamons" (success).
20:20 ๐Ÿ”— Tephra wget moves slow
20:21 ๐Ÿ”— Tephra seems like a patch was made in 2012
20:22 ๐Ÿ”— SilSteStr but will it work with 1.14? ^^
20:23 ๐Ÿ”— Tephra dunno
20:25 ๐Ÿ”— SilSteStr I'll make a first run... after there is hopefully time for tweaking...
20:29 ๐Ÿ”— Tephra maybe we should file a bug report and hopefully it gets fixed
20:31 ๐Ÿ”— Tephra SilSteStr: could you send me the complete command and url that you are trying?
20:32 ๐Ÿ”— SilSteStr wget --warc-file=tunewiki --no-parent --mirror -U "Wget/1.14 gzip Archiveteam" -mbc -e robots=off https://www.tunewiki.com
20:32 ๐Ÿ”— Tephra thanks!
20:33 ๐Ÿ”— RichardG I'm uploading Google Answers while IA doesn't move it to ArchiveTeam - right now doing batch 390 of 787
20:33 ๐Ÿ”— ivan` wget does not support gzip don't put that in the user agent
20:35 ๐Ÿ”— SilSteStr kk
20:36 ๐Ÿ”— SilSteStr but doesnt change anything...
20:37 ๐Ÿ”— ivan` I know, it's just a moral hazard to leave it in there
20:37 ๐Ÿ”— SilSteStr kk ^^
20:38 ๐Ÿ”— ivan` one day I'll want some gzip data and that sucky user agent has spread all over the internet
20:38 ๐Ÿ”— SilSteStr if I sent a wget to background... is there a possibility to get it to the foreground?
20:38 ๐Ÿ”— ivan` SilSteStr: you can start it in screen, detach, attach
20:38 ๐Ÿ”— ivan` or tmux if you like that
20:39 ๐Ÿ”— SilSteStr and if its already started? ^^
20:39 ๐Ÿ”— SilSteStr chose to log to a file
20:39 ๐Ÿ”— ivan` fg, maybe
20:40 ๐Ÿ”— SilSteStr not working...
20:40 ๐Ÿ”— SilSteStr looks like russian roulette then ;-)
20:43 ๐Ÿ”— Tephra hmm can't get it to work, looks like a genuine bug to me
20:44 ๐Ÿ”— SilSteStr kk
21:27 ๐Ÿ”— arrith1 SilSteStr: should google: how to cat files
21:27 ๐Ÿ”— arrith1 SilSteStr: that's how you cat warcs together
21:28 ๐Ÿ”— SilSteStr this works with warcs?
21:29 ๐Ÿ”— SilSteStr what about double files?
21:29 ๐Ÿ”— arrith1 SilSteStr: like wget when a thing is a directory and file with the same name?
21:31 ๐Ÿ”— SilSteStr ?
21:40 ๐Ÿ”— arkhive How do I know if The Way Back Machine grabbed all of a site? www.xbdev.net/index.php
21:40 ๐Ÿ”— arkhive Is there a way to compare it automatically?
21:41 ๐Ÿ”— arrith1 SilSteStr: what do you mean by "double files"?
21:42 ๐Ÿ”— arrith1 arkhive: "if you have a video list in a file you can use ia-dirdiff to check the items on IA"
21:42 ๐Ÿ”— arrith1 arrith1: that was said earlier in #archiveteam-bs, so ia-dirdiff might do what you want
21:43 ๐Ÿ”— arrith1 arrith1: one way to know for sure is to wget it yourself then upload it to IA with warcs ;)
21:59 ๐Ÿ”— arrith1 er
21:59 ๐Ÿ”— arrith1 arkhive:
21:59 ๐Ÿ”— arrith1 no idea what's going on
22:01 ๐Ÿ”— arkhive arrith1: I can't grab it myself atm
22:02 ๐Ÿ”— arrith1 acknowledged
22:03 ๐Ÿ”— omf_ You could also check the CDX search for urls
22:10 ๐Ÿ”— SilSteStr did you think about a rasperry pi warrior?
22:16 ๐Ÿ”— omf_ I have a raspberry pi warrior
22:16 ๐Ÿ”— omf_ I made it a few months ago
22:17 ๐Ÿ”— SilSteStr does it work good?
22:18 ๐Ÿ”— SilSteStr I would like something autonomous to spread to family ;-)
22:18 ๐Ÿ”— SilSteStr already did this with tor
22:18 ๐Ÿ”— SilSteStr small little boxes, safe config
22:19 ๐Ÿ”— SilSteStr should work for warriors as well ;-)
22:20 ๐Ÿ”— arrith1 probably would be fairly CPU-bound, but possibly optimize-able. at least rasbian is sort of the default, shouldn't be too hard to attempt porting of the warrior stuff from the vm
22:20 ๐Ÿ”— omf_ it is not cpu bound it is RAM limited
22:20 ๐Ÿ”— omf_ wget is a filthy pig
22:21 ๐Ÿ”— arrith1 omf_: max ram i've heard of on the raspi is 512 MB, what sort of usage does the warrior's wget get to?
22:21 ๐Ÿ”— omf_ My plans are to do a few week test run doing url shorteners to see how it works out
22:21 ๐Ÿ”— omf_ wget uses more ram the more urls it collects
22:22 ๐Ÿ”— omf_ so the bigger the site, the more ram
22:23 ๐Ÿ”— arrith1 well, could try to do some heuristics to keep the urls the raspi warrior loads up at once within RAM limits
22:24 ๐Ÿ”— omf_ which would require changing the mess of shit code known as wget
22:24 ๐Ÿ”— omf_ I am testing out a warc convertor for httrack since it already is smart about managing ram and concurrent connections
22:25 ๐Ÿ”— arrith1 hmm maybe. i think wget-lua is fairly powerful though.
22:25 ๐Ÿ”— omf_ In terms of web scrapers wget is the drooling retard in the corner
22:25 ๐Ÿ”— xmc heh, definitely
22:25 ๐Ÿ”— omf_ wget-lua is a hack to work around wget's design flaws
22:26 ๐Ÿ”— omf_ and it uses more ram since it has to run lua
22:29 ๐Ÿ”— omf_ we use wget because it has warc support built in. I am working on warc support for better applications
22:29 ๐Ÿ”— dashcloud Nemo_bis: HP carries the service manuals for all their products on the product page- if you can easily extract those, it would be a great addition to the service manuals collection
22:29 ๐Ÿ”— omf_ just like someone took the time to build warc into wget. Evolution based on available developer time
22:34 ๐Ÿ”— Nemo_bis where someone is dear al.ard
22:53 ๐Ÿ”— namespace omf_: What program are you putting WARC into?
22:53 ๐Ÿ”— arrith1 i think httrack
22:54 ๐Ÿ”— arrith1 which interestingly is GPLv3. don't really see much GPLv3

irclogger-viewer