#archiveteam-bs 2013-06-24,Mon

↑back Search

Time Nickname Message
00:14 🔗 omf_ ivan`, you just need rss feeds
00:14 🔗 omf_ or domain names
00:16 🔗 ivan` omf_: I need feeds, but I can infer the feed URLs based on subdomains or a /path or /user/path
00:17 🔗 ivan` there are also some 'obvious' feeds like /rss.xml /atom.xml that would be nice to get
00:17 🔗 ivan` I can make a pattern for those too
00:17 🔗 ivan` the domains will be the sites listed in http://www.archiveteam.org/index.php?title=Google_Reader and more, if I go look for them tomorrow
00:17 🔗 arrith1 omf_: for example one thing that can be in a url is a username which can be used to infer feeds
00:20 🔗 ivan` will domain prefixes + url filtering script work for you?
00:21 🔗 ivan` or, I could skip the domain prefixes and have the script filter all 160 billion URLs
00:26 🔗 omf_ ivan`, did posterous have rss feeds?
00:30 🔗 ivan` yes
00:30 🔗 ivan` I've grabbed them, but almost all were 404 in reader's cache
00:30 🔗 omf_ I assume you loaded that 9.8 million url list
00:30 🔗 ivan` yes
00:30 🔗 omf_ what about the live journal list we got
00:30 🔗 ivan` IA list has been imported, #donereading is working on an lj crawler
00:31 🔗 ivan` is there another lj list?
00:31 🔗 omf_ xanga users
00:31 🔗 omf_ not that I know of
00:31 🔗 ivan` by IA list I meant wayback
00:31 🔗 ivan` xanga has been imported
00:31 🔗 ivan` hm, not the new discoveries though
00:34 🔗 arrith1 also #donereading on a blogspot crawler
00:35 🔗 omf_ what about reddit subreddits
00:36 🔗 ivan` done
00:36 🔗 joepie91 anyone that natively speaks something that isn't Dutch or English, want to help out with the VPS panel I'm working on?
00:37 🔗 joepie91 https://www.transifex.com/projects/p/cvm
00:37 🔗 joepie91 (if a language is missing, tell me and I'll add it :P)
00:38 🔗 omf_ I assume you also loaded in the alexa and quantcast url lists
00:38 🔗 ivan` I did not know those existed
00:39 🔗 ivan` is there a dump or do I have to query something?
00:39 🔗 arrith1 omf_: yeah any url list ideas you have would be greatly appreciated
00:40 🔗 omf_ here let me link you
00:40 🔗 omf_ top 1 million sites http://s3.amazonaws.com/alexa-static/top-1m.csv.zip
00:42 🔗 omf_ and here is the quantcast top 1 million sites https://ak.quantcast.com/quantcast-top-million.zip
00:42 🔗 ivan` thanks
00:42 🔗 ivan` so, will you be able to run through the cuil data?
00:42 🔗 omf_ it will take time
00:42 🔗 ivan` okay
00:43 🔗 omf_ what about the url lists from the url shortenors
00:44 🔗 ivan` good idea
00:44 🔗 omf_ http://urlte.am/
00:44 🔗 omf_ what did you get our of common crawl
00:45 🔗 ivan` 2.4 billion URLs, a lot of stuff imported from there
00:45 🔗 ivan` I have a 22GB bz2 of their URLs
00:47 🔗 omf_ that seems small the common crawl url index with no content is 200gb compressed
00:48 🔗 ivan` https://github.com/trivio/common_crawl_index claims 5 billion URLs
00:49 🔗 ivan` I got 2.4 billion when running the included Python program
00:49 🔗 ivan` 5 billion URLs compressed can't take up 200GB
00:52 🔗 ivan` urlteam torrent has 1 seed uploading at 1KB/s
00:53 🔗 omf_ I think that is all on IA as well
00:54 🔗 ivan` I see only 2011 dumps on IA
00:59 🔗 ivan` I will bbl, gotta sleep
01:10 🔗 omf_ ivan`, I got a few smaller but interesting lists
01:10 🔗 omf_ universites - http://doors.stanford.edu/universities.html
01:12 🔗 omf_ http://catalog.data.gov/dataset/us-department-of-education-ed-internet-domains
04:40 🔗 Coderjoe wow
04:40 🔗 Coderjoe http://www.forensicswiki.org/wiki/Main_Page
04:41 🔗 Coderjoe just stumbled on this while looking for information on disk image formats
04:43 🔗 omf_ Coderjoe, that is a good site
04:52 🔗 Coderjoe is there a simple fuse program that can take a raw disk image and export the partitions? (I've tried guestfs and vdfuse and both do not like me)
04:52 🔗 omf_ you mean mount it?
04:52 🔗 omf_ fuseiso
04:53 🔗 Coderjoe the ultimate goal is to mount a partition in the image without using loop or dm
04:53 🔗 omf_ fuseiso blah.iso dir/
04:53 🔗 Coderjoe not an iso
04:53 🔗 omf_ why not just create a chroot and mount it in there?
04:54 🔗 Coderjoe eh?
04:55 🔗 Coderjoe i have a rawhardriveimage.img file made using dd. the first 512 bytes contain an MBR partition table with one partition (in this case).
04:55 🔗 Coderjoe I want to use ntfs-3g on the partition in the image to access files in the image
04:56 🔗 Coderjoe without dding out the partition and without using loop devices and/or device mapper. (IE: without needing root)
04:57 🔗 Coderjoe (errata: the image is also compressed with bzip2 and being accessed through avfs at the moment)
04:59 🔗 omf_ Coderjoe, you tried guestmount from guestfs
04:59 🔗 omf_ no root required
04:59 🔗 Coderjoe yes
05:00 🔗 Coderjoe gave stupid error that I cannot determine the cause of
05:01 🔗 Coderjoe also, it probes deeper than I really care for. (it determines filesystem types and everything)
05:21 🔗 arrith1 Coderjoe: yeah that site is pretty good for documenting disk recovery stuff/ ddrescue (aka gddrescue) vs dd_rescue for example
05:23 🔗 arrith1 Coderjoe: depending on a system's config, you can usually interact with loop stuff without root, at least on ubuntu. might be some group thing
05:23 🔗 arrith1 kpartx and losetup with offsets
05:23 🔗 godane looks like the twinkies are coming back
05:23 🔗 godane july 15th i hear
05:27 🔗 arrith1 not in time for 4th of july aw
07:28 🔗 godane SketchCow: this is all you: http://www.atlasobscura.com/articles/object-of-intrigue-mickey-mouse-gas-mask
07:31 🔗 godane in that its something to so at one of speaches
07:39 🔗 ivan` omf_: nice, thanks
10:41 🔗 ivan` http://reader.aol.com/
10:42 🔗 ivan` they even implement a Reader-style API
10:42 🔗 joepie91 hah
10:42 🔗 joepie91 "so Google doesn't want to do Reader? fine, we'll do it then"
10:43 🔗 norbert79 too bad it refuses to work
10:44 🔗 norbert79 Doesn't do anything in FF 21
10:44 🔗 norbert79 Close, but no cigar
10:45 🔗 norbert79 on the other hand, Deep Space Nine - The fallen is a great UT 1 engine based game
10:46 🔗 norbert79 but on an unrelated note really
12:17 🔗 Smiley Anyone here good with building ec2 images?
12:17 🔗 * Smiley wants to make a xanga one
12:26 🔗 Smiley I honestly have no clue where to start
12:27 🔗 Smiley I'm thinking fire up a default debian install
12:27 🔗 Smiley then install the extra bits on top
12:36 🔗 ivan` why bother with an image if you can just use ssh to execute a setup script on each box
12:37 🔗 joepie91 so
12:37 🔗 joepie91 have we archived scribd yet
12:37 🔗 Smiley ivan`: ok, tehn i need a setup script ;)
12:37 🔗 Smiley I just need some automated way of firing up a load of instances.
12:38 🔗 ivan` joepie91: no, and +1 on that, there's a lot of good stuff there
12:40 🔗 Smiley concidering we have 2 dying projects at teh mo, you can feel free but don't expect much help.
12:41 🔗 joepie91 I really don't like Scribd :/
12:42 🔗 Smiley wtf why can't I ssh into these ec2 instances o_O
12:42 🔗 Smiley debug1: Authentications that can continue: publickey
12:42 🔗 Smiley debug1: Trying private key: ./.ssh/amazon.pem
12:42 🔗 Smiley debug1: read PEM private key done: type RSA
12:42 🔗 Smiley So it is reading mah key :/
12:43 🔗 norbert79 joepie91: I also wondered how a PHP and flash based webpage could be archived well with all functionalities and documents inside, especially document access requires an active username on scribd :)
12:43 🔗 Smiley eh, where have my ssh rules gone o_O
12:46 🔗 joepie91 norbert79: as I said, I don't like Scribd
12:47 🔗 norbert79 joepie91: I understand, I was just curious, as I really would like to see some solutions to such
12:48 🔗 norbert79 joepie91: one of the webpages I used to visit, all around former Luftwaffe (http://luftarchiv.de) was once my target for full dump, but I failed badly...
12:48 🔗 joepie91 obvious solution would be automated account creation and downloading
12:48 🔗 joepie91 but that'd probably require you to spend a few bucks on breaking captchas
12:49 🔗 norbert79 I wonder if any webpage wiould offer a sharing of their whole webpage for free, like a backup of the page running the stuff
12:49 🔗 joepie91 how do you mean?
12:50 🔗 norbert79 I mean like a webpage we would like to archive based on PHP would run on a server, within a subdirectory and we would ask kindly and given of the content of that directory
12:50 🔗 norbert79 that would be nice
12:50 🔗 norbert79 if webpage owners would willing sharing
12:50 🔗 norbert79 without the sensitive data of course
12:51 🔗 joepie91 I was actually thinking about that
12:51 🔗 joepie91 perhaps a framework should exist for sites to offer a 'backup' file
12:51 🔗 Smiley ok who uses the pipeline/seesaw stuff?
12:51 🔗 joepie91 in a standardized format
12:51 🔗 joepie91 something that doesn't take long to implement
12:52 🔗 norbert79 joepie91: Aye the same was I thinking about too
12:52 🔗 norbert79 joepie91: Like I run a Wiki, I would be happy sharing that without the sensitive stuff, including all dump, like the SQL dump
12:52 🔗 norbert79 but a method would be nice to exist to be able to put those packages to a VM image to make it work again
12:53 🔗 joepie91 not necessarily even a VM image
12:53 🔗 joepie91 just some standardized machine-readable format
12:53 🔗 norbert79 aye
12:53 🔗 norbert79 Just thinking about this
12:53 🔗 joepie91 speaking of which, mind speccing that out a bit?
12:53 🔗 joepie91 as to what kind of data you would need to store
12:53 🔗 joepie91 etc
12:53 🔗 joepie91 how it'd be structured
12:53 🔗 norbert79 hmm
12:53 🔗 norbert79 In my case I have a mediawiki running
12:53 🔗 joepie91 I'll have a think over it as well
12:53 🔗 norbert79 I have some specific changes
12:53 🔗 joepie91 perhaps combining the ideas it might yield a nice result
12:54 🔗 norbert79 like I use a captcha but I use a bash script to make random pictures for replacing old captcha pictures which runs as a cronjob
12:54 🔗 norbert79 I use almost the generic things, has short URL
12:54 🔗 norbert79 but runs within one subdirectory instead of /var/www
12:54 🔗 norbert79 so it's a bit modded
12:55 🔗 norbert79 but nothing serious
12:55 🔗 norbert79 joepie91: https://telehack.tk/wiki
12:56 🔗 norbert79 but a Wiki is an easier thing as there are already methods available
12:56 🔗 norbert79 the issue is with non-standard-cms engines
12:56 🔗 norbert79 like anything written manually
12:57 🔗 norbert79 there should be more API's existing
12:57 🔗 norbert79 for more common solutions and easier dumping
12:57 🔗 norbert79 but of course why would homepage owners wish to share their content this easy
18:49 🔗 godane i got a shit ton of maximum pc disks today
18:50 🔗 winr4r godane: excellent
19:04 🔗 godane i'm going to upload my cnn20 disk
19:06 🔗 Smiley metric or imperial shitton?
19:06 🔗 Smiley ;)
19:23 🔗 ersi "Facebook login | 100% anonymous!"
19:23 🔗 ersi lol'd
19:30 🔗 joepie91 where?
19:37 🔗 ersi doesn't really matter.. but in the corner to the right at https://trigd.com/
19:53 🔗 godane uploaded: https://archive.org/details/cdrom-cnn20
23:45 🔗 joepie91 http://cryto.net/~joepie91/manual.html
23:47 🔗 xmc interesting.

irclogger-viewer