kisspunch: There are at least a couple IA staff here, I'm not sure if Somebody2 works there, but oftentimes someone will know the answer to a question even if they don't work at IA, or they'll redirect you to someone who would be more likely to know the answer. kisspunch: I do not, but I hang around with people who do... kisspunch: I guess to answer that question, you have to first answer another one: what do you want the person downloading the files to be able to do? just see the content? track changes between time frames? recreate the exact experience a person would've had at a point in time? something else? kisspunch: I saw your earlier thing talking about what kind of thing you have- since it's code, this talk is probably along the lines of what you want: https://www.youtube.com/watch?v=Xx6Bb2sY4zo it's basically an archive of everything on GitHub that has 10 stars or more, without using endless space dashcloud: nice I've moved to a new apartment. Massive connection, and actual heat, air conditioning and working bathroom. And drinkable water! Will be more productive MORE productive :O Lot to do Lot to make up gigabit? Let's not go crazy. 300mb. Quite good. cool "How is internet in your area? I pay $27 for this crap. Supposed to be 500mbit. In Kyiv you can have 1gbit for 6 euro." Russians complaining about their 380/500 fuckin Right now I'd be jealous for having breathable air. China? or Burbank? Colorado Springs, actually. oh Question: If North Korea fixed this issue, then why do some domains still work? https://github.com/mandatoryprogrammer/NorthKoreaDNSLeak hook54321: They fixed the leak, i.e. you can't get a list of domains through AXFR anymore. hey everyone i'm on my slackware rpi distro i just build this morning turned out part of my problem was glibc-solibs script it was not making the links that was what was crashing berryboot kernel hey odemg hey i got slackware arm working odemg: my plan is to make a librarybox+kiwix hybrid on slackware arm hey the other day I asked about scanning/submitting some old UK SF magazines (Interzone) and was advised to do 600dpi/TIFF. No problem. Any other advice or tips or URLs to read on scanning projects in general? Should I chop up the resulting TIFFs into sub-pages (each side is two separate pages from the publication) etc slackpi, are you confusing me with someone else, this is the first I'm hearing of it? http://radio.garden/ - a nice distraction, at least i have talked about it before on archiveteam-bs at least i think i talked about it here slackpi: o wait are you godane yes kool i'm on my raspberry pi 2 same room yeah, i seem to recall having seen you mention radio.garden a while ago thats the one with radio stations around the world yeup Jon: you might be interested in writing to the internet archive directly, as they routinely do book scanning, or just look at how they handle recently scanned books that are handled internally by them. i'm now back on my main system for the moment :P i'm at 3231 items for this month so far i'm getting close to half of the items i had last month Jon: yes each tiff should be a left or a right hand, not both name them 0001.tif 0002.tif etc, and put them in an archive named (whatever)_images.tar you don't have to name them anything in particular so long as they sort correctly Hey guys, I have a question about archiving something ask away I want to archive a couple of Diney website games, but they seem to be some sort of horrible multi-part / multi-file SWF files godane: http://www.oldradioworld.com/media/ (via /r/opendirectories) So not sure how to proceed MartinThe: ah, the type that loads new files on demand as you click through the game? I'm trying WarcMITMProxy, but last commit is from 4 years ago and it looks like dependencies broke big time. I'm running Ubuntu 16.04 LTS joepie91_, Correct ah yes, those are a pain, I don't think there's a bulletproof solution for those yet MartinThe: link to this software? https://github.com/odie5533/WarcMITMProxy astrid, ^^ was linked to on the a-t.org wiki MartinThe: afaik, your options are indeed either a warc proxy of some sort, or using a decompiler/converter that can take apart the SWFs and scripting your way around it former being theoretically easiest joepie91_, Augh, decompiling is something I'd rather not do. WARC proxy looks like the best option Would webrecorder work? https://webrecorder.io/ hook54321, Not sure, the new downloads are triggered from the running SWF hook54321, I presume webrecorder is basically a wget-type deal? Have you tried warcprox? https://github.com/internetarchive/warcprox joepie91: i'm going to be lazy and give it to archivebot MrRadar, Looks cool, will check it out in a minute. Thanks a lot! MartinThe: You enter a starting URL and then you browse stuff manually and it puts it all into a WARC godane: heh. just figured you might be interested in it given that you seem to do a lot of podcast/radio stuff :) Oh heck, not just SWF. This thing's doing XML requests too. Whoa. Yup, WARC looks like the only way. MrRadar: Warcprox works fine Glad I could help arkiver: Did the imgh.us person reply? Also, did you contact them through the email address listed in whois or through the form on their site? Anyone in here comfortable parsing XML? SketchCow: what for? I would get my skills in perl6 honed a bit, if the task seems like I could handle it. I think the motivation is enough to make me do it. Would take like at least 12h though... I'm going to do it a stupid way Hold my avocado Ha Ok, just thought you might have some time to get it done. Parse it with regex! :-) I think he does, i guess that is the only stupid way. I have a question about the wayback machine I'm not sure where else to pose sun_shine: shoot An historically important website I need for research purposes has been maliciously excluded the domain is now owned by spammers who aren't interested in selling it. I'm not sure that the creators of the site can be contacted is there anything I can do? nope sun_shine: Did you check if the creators of the site had an email listed in the whois for the domain? this was back in 2009. is there anywhere i can look up historical whois stuff like that? what's the site? isaccorp.org Seems to work fine for me. https://web.archive.org/web/*/http://isaccorp.com/ the site was at isaccorp.com until 2005, when it moved to isaccorp.org oh The site had enemies. I can't say for certain that the original owners weren't the ones who asked for it to be excluded, but it would be out of character. And it seems like it was manually excluded rather than by robots.txt There's a mirror of the wayback machine, it isn't up to date though. http://web.archive.bibalex.org/web/*/http://isaccorp.org I'm gonna try to find a way to contact the previous owners. wait, so does this show the captures that exist but currently aren't available? Right now, someone in Ukraine named Andrey Ahiezer owns the domain. Have you ever heard of a site being unexcluded? I know if the issue is robots.txt, then whoever controls the domain effectively controls its past availability as well It shows the captures that existed at the time they mirrored the wayback machine. But since it was manually excluded I'm not sure that someone could override that even if, say, the present owner decided to Could someone from Ukraine or someone named Andrey Ahiezer have been an enemy of the site? Really unlikely. If the archive cuts off at 2007, though, that seems to suggest when the request for removal was sent Or when they last updated the mirror When a site is excluded manually they will still crawl it oh, nevermind they last updated the mirror in 2007 http://web.archive.bibalex.org/web/*/http://example.org sun_shine: domaintools has whois history, it's not free though. https://whois.domaintools.com/isaccorp.org you know, I have very rarely encountered 'domain excluded' errors when using wayback and I'm a really heavy user I just checked on two other defunct advocacy websites in the same area. Both excluded - and I know that the first one was purchased by the corporation it published exposes on after the owner died. I think they bought the domains after the expired, had them excluded, and then dumped them What are the other two domains? and the corporation intrepidnetreporter.com and caica.org . The corporation that bought intrepidnetreporter is called WWASP and has a documented history of suing online critics. All three of these websites reported critically on them. https://en.wikipedia.org/wiki/World_Wide_Association_of_Specialty_Programs_and_Schools I think I'm just going to write info@archive.org and ask nicely. I'm not sure there's any other option. probably yeah