[00:17] *** j08nY has quit IRC (Quit: Leaving) [00:19] *** Ravenloft has joined #archiveteam-bs [00:44] Going through old stuff, I have about 200 Offline Explorer backup files containing sites that no longer exist on the live web (I haven't looked at wayback/reocities yet or tried to track them down if they moved). Anyone want these? [00:58] tsp_: the answer is always yes :) you can just upload them to an item on archive.org [00:58] tsp_: ideally with as much information as you can remember about how you created them [01:08] if you need any help with it tsp_ let me know [01:12] the project for imzy is started [01:13] the warrior? [01:13] yes [01:14] https://tracker.archiveteam.org/imzy/ [01:14] FOS is the target [01:14] if it gets bogged down you can use jrwr.io [01:14] thanks [01:14] wheres the github [01:14] will keep that in mind [01:14] imzy-grab [01:14] you have free use of that server until the 6th of the month [01:14] then it expires [01:14] sounds good [01:14] wat [01:14] Resource Limit Is Reached [01:14] on the AT wiki [01:15] *** wp494 has joined #archiveteam-bs [01:17] Spinning up Virtual Machines [01:17] greenie: we have started [01:17] with archiving imzy [01:17] 20 VMs about to land [01:18] well, the site is a little slow [01:18] they'll mostly be limited by the tracker [01:18] * jrwr turns off 15 of them [01:18] Is there a channel for Imzy? [01:19] no [01:19] we can all hang out in here :p [01:19] What's the recommended concurrent? [01:19] 6 [01:19] OK. I'll start there [01:19] * arkiver is afk [01:19] ping me if you see anything strange [01:20] *** tfgbd_znc has quit IRC (Read error: Operation timed out) [01:32] Thanks. [01:34] jrwr: Can I Upload them under one item? If so, what identifier do I use? [01:34] opensource [01:34] web [01:34] and just use a name others can use to find it [01:36] I don't know how people will find it, it's about 200 individual files with different sites in them. [01:37] are they named well? [01:37] the files [01:37] like what domain they came off of [01:37] Not really [01:38] Hrm [01:38] "Collection of archived websites" [01:39] and just upload them [01:39] * tsp_ nods [01:39] any details you can put in the desc helps [01:39] I can extract a list of base URLs from most of them, there's a .dat inside the archives which can tell me where they came from. But it's more difficult to extract what settings I used to archive them. [01:40] if you can prepare the list of domains/URLs and put them in the desc [01:40] they will be indexed so if someone goes looking for it [01:41] * tsp_ nods [01:41] you can start the upload now if you want [01:41] you can edit the desc after the fact [01:42] Ok. How do I start uploading, the ia command line utility? [01:42] there is [01:42] Can I also rename and delete files after the fact? Some of these might not be dead. [01:43] You can [01:43] but I don't suggest it, since its always good to have some kind of backup in case [01:44] http://internetarchive.readthedocs.io/en/latest/cli.html [01:44] the web interface does work pretty well [01:45] I can also upload another 200 live ones ripped from the same time, I just separated out the dead ones a while ago. [01:45] never hurts [01:45] I don't think the web interface lets me upload hundreds of files in bulk, though. [01:46] no, it does not like that [01:46] the ia cli command does [01:47] I still want to keep live and dead separated, I'll focus on the dead ones since they're the most important then figure out what to do with the rest. [01:47] Yep [01:47] if you like saving sites, I do suggest warcproxy [01:48] https://github.com/internetarchive/warcprox [01:48] I forgot about that. I download fanfiction sites sometimes in bulk, will it work for, say, 10 GB of data next run? [01:48] if you are browsing websites [01:49] that proxy saves it in a way that will flat out import into wayback machine [01:50] I wonder if there is a effort to save onion sites to IA [01:51] i don't know of one, but i would very much approve [01:55] I would use a fork of archivebot [01:55] since it has all the ignore lists [02:04] I'll just call it MiscWebsiteRips or something. [02:18] jrwr: I guess I want texts or data (probably data since these are archives) and not web, since they're not warc files? [02:19] its still from the web [02:19] so its best if you put them under web and community texts [02:19] works the best really [02:27] Oh, so ia upload MiscWebsiteRips *.boe --metadata=mediatype:web [02:28] How would I get it under opensource? [02:31] tsp_: it will default to being in the opensource collection, I think. [02:35] Is there a preference to identifier names? MiscWebsiteRips vs misc_website_rips vs misc-website-rips? [02:37] *** pizzaiolo has quit IRC (pizzaiolo) [02:38] tsp_: not between those choices, no. Probably the underscore version is safest. [02:38] as there are file systems without case-sensitivity, and shell tools that grumble at dashes. [02:39] I'd personally recommend uploading each of the sites into *separate* items, rather than one, though. [02:39] That way you can have separate metadata values for each site [02:40] but uploading *anything* is much better than not. [02:41] Somebody2: Can I split them out later? Right now they're kind of a mess, and I'd have to go through and figure out when things were downloaded, which ones have images, only partial sites, etc [02:42] Oh, there's an ia move command, so I can. [02:42] i think that only moves files within items [02:42] but i'm not sure [02:43] You absolutely should upload them as one item if it would take work to separate them. [02:43] tsp_: [02:44] There's nothing at *all* wrong with immediately uploading a "raw" item with the pile-of-bits you currently have ... [02:44] oh, huh, it might let you move files from one item to another! [02:45] ... and then later uploading cleaned up, split up items with better metadata later. [02:45] If/when IA wants to de-duplicate the data they store, they can and will do so -- we uploaders should not worry about it [02:46] (unless you are getting up into the tens of terabytes range) [02:46] xmc: archivebot is "interesting" to setup [02:47] heh i bet [02:47] The dead collection is only 2.6 GB, so not much. [02:48] tsp_: Yeah, at 3 GB, don't hesistate to upload a combined version now and split ones later. [02:53] xmc: Ill have it run in #TorArchiveBot [02:53] since I don't want to rewrite the commands so it can exist in #ArchiveBot [02:53] sounds good to me! [02:55] jrwr: wait, why are you running a copy of archivebot? [02:55] I want to make a Tor version of it [02:55] to archive tor websites and maybe other darknets [03:06] arkiver: I'm getting an infinite sequence of HTTP 206 responses for a URL for Imzy: 87=206 https://www.imzy.com/api/accounts/profiles/mcnulty?check=true [03:22] jrwr: cool. I'm a bit doubtful about how many darknet websites there are that are sufficiently public that I'd support archiving them into a public repository like archive.org [03:22] but it's good to have available [03:30] idk, archive everything on .onion that you can touch [03:35] xmc: presumably not the entirity of facebook... :-) [03:37] *** gui7 has quit IRC (Read error: Operation timed out) [03:38] well, sure [04:01] This is off-topic, but does anyone know what to do about an insect that is crawling around inside an LCD [04:01] Like behind the panel [04:03] I've turned it off hoping it'll leave, dunno what else can be done. Probably nothing short of taking it apart :/ [04:53] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [04:58] *** Yurume has quit IRC (Remote host closed the connection) [05:00] *** Sk1d has joined #archiveteam-bs [05:07] *** Yurume has joined #archiveteam-bs [05:16] Somebody2: I'm hearing comments which imply "something" happened regarding the robots.txt thing i mentioned earlier, and none of the IA staff are allowed to talk about it [05:32] Lord_Nigh: what kind of something [05:32] not sure. just comments from ia staff on twitter saying they can't talk about it, when asked flat out [05:33] the faq was never updated to note that retroactive blocking is once again possible (and in fact expanded) either [05:34] the reveal of this has been completely bungled and i'm not very pleased [05:35] see https://twitter.com/TheMogMiner/status/873950228994502658 [05:36] themogminer got it wrong about the faq being revised; the faq was NEVER revised (or it wasn't 2 weeks ago, i didn't look since) and still implies that retroactive blocking isn't possible [05:45] the faqs page looks like it was updated, but is even less clear about removal and retroactive stuff than before [05:46] no, i'm wrong [05:46] its the same from 2 years ago: [05:46] "How can I have my site's pages excluded from the Wayback Machine? [05:46] You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message. " [05:46] before 2 years ago it was: [05:47] "How can I have my site's pages excluded from the Wayback Machine? [05:47] You can exclude your site from display in the Wayback Machine by placing a simple robots.txt file on your Web server. [05:47] Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org. [05:47] If you are emailing to ask that your website not be archived, please note that you'll need to include the url (web address) in the text of your message. " [05:48] the "directions on how to automatically exclude your site" was a link to https://web.archive.org/web/20130606003203/http://archive.org/about/exclude.php [05:48] faq from 6/2013 is https://web.archive.org/web/20130606003203/http://archive.org/about/faqs.php [05:49] so the faq was never updated to restore the old verbage, and the exclude.php is a 404 currently [05:49] I thought they were making the robots.txt handling more permissive, not less [05:49] at least that's what I got from people talking about it in here [05:50] the info on exclude.php in the archive isn't valid anymore either, since archive.org respects user agent of "*" now, it doesn't explicitly look for "ia_archiver" like it did 2 or 3 years ago [05:50] (permissive meaning blocking fewer sites) [05:50] Frogging: exactly! this is really mysterious [05:50] for 2 or 3 years it only respected the robots.txt that was present when the site was archived, and did not allow 'retroactive' blocking at all [05:50] i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archiver [05:50] er [05:50] i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archive [05:51] but if later or earlier the robots.txt didn't exist or didnt block "ia_archiver" it would happily let the site display [05:52] the behavior now is if robots.txt blocks any sort of robot by "*" user agent (and probably also "ia_archiver" though I have not tested that), and that file is currently live, it will block EVERY ARCHIVED VERSION OF THAT SITE, EVER [05:52] this is a major problem with malicious squatters, who have gone right back to abusing the system like they did 3 years ago [05:54] so this is why i am majorly concerned, especially since the https://twitter.com/TheMogMiner/status/873950228994502658 text implies there may be a legal/court order preventing IA staff from talking about it, let alone fixing it [05:54] yes [05:54] i'm going to bed [05:55] and since i haven't heard any noise on slashdot or ycombinator about this, i assume it is a sealed court order, or even possibly an NSL [05:55] which is even creepier [05:56] could even be Majestic 12 influence [05:56] 🤔 [05:56] I'd be inferring something to do with legal system bustedness, that's for sure. [05:58] lol that tweet [05:58] really doubt it's an NSL, I'm sure there are channels for that and they do not involve robots.txt [05:58] inferring conspiracies from "not allowed to comment" [05:59] Warren Spector missed the mark on Deus Ex [05:59] Yeah, I'm loving that they're inferring *anything* more involved than "something something legal system". [06:00] (hell, it might not even be a sealed court order, just a court order we're ignorant of, and "not allowed to comment" is more "IA's lawyers would hate me for saying anything") [06:01] (or a settlement, or, well. Anything, really. Not much to go off of.) [06:01] sealed settlement would suck since that means it can never change again [06:02] I'm probably reading way too much into those comments [06:02] I'd not be confident reading much more into it than "there is probably a lawyer involved in this". [06:21] I started uploading misc_website_rips with ia, but it stopped here: error uploading The Fanfic Vault.boe: We encountered an internal error. Please try again. - uploadItem.py: uploading id = misc_website_rips using contribSubmit! [06:22] I didn't see anything in the docs about what to do next, most importantly, will it resume if I run the same command again with the list of previously-uploaded files? [06:59] *** SHODAN_UI has joined #archiveteam-bs [07:04] *** schbirid has joined #archiveteam-bs [07:50] *** j08nY has joined #archiveteam-bs [08:11] *** BlueMaxim has quit IRC (Quit: Leaving) [09:07] *** SHODAN_UI has quit IRC (Remote host closed the connection) [10:00] *** pizzaiolo has joined #archiveteam-bs [10:39] *** j08nY has quit IRC (Read error: Operation timed out) [11:22] *** SHODAN_UI has joined #archiveteam-bs [12:02] *** j08nY has joined #archiveteam-bs [12:51] *** dcmorton has quit IRC (Quit: ZNC - http://znc.in) [12:56] *** dcmorton has joined #archiveteam-bs [13:45] *** Odd0002 has quit IRC (Remote host closed the connection) [15:20] tsp_: correct it will resume [15:22] "If you think the Archive is a bad actor you are 12 pineapples short of a luau" [15:22] Word-Smithing at its finest [15:35] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [15:46] *** BartoCH has joined #archiveteam-bs [15:52] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [16:11] *** Pudsey has joined #archiveteam-bs [16:27] tapedrive: if i could make a suggestion, archive.org/details/fanfictiondotnet_repack much better version than one giant tar file. that was my bad. [16:30] I've actually downladed that (the zip of the entire item) a few weeks ago, and I'll be integrating that into my archive and database - along with the 2012 scrapes - which will allow us to see how many stories we're actually missing overall. [16:32] And also thanks for doing that scrape, bsmith093, it's been a lifesaver to me many times [16:39] *** Pudsey has quit IRC (Remote host closed the connection) [16:42] *** pizzaiolo has joined #archiveteam-bs [16:48] *** tuluu has joined #archiveteam-bs [16:59] *** tuluu has quit IRC (Ping timeout: 260 seconds) [17:01] *** tuluu has joined #archiveteam-bs [17:13] *** mls_ has quit IRC (Read error: Connection reset by peer) [17:14] https://archive.org/details/godaneinbox?sort=-publicdate&&and[]=subject%3A%22Sports%20Illustrated%22 [17:14] *** mls_ has joined #archiveteam-bs [17:25] bsmith093: Although you're right in that several smaller files are easier to handle, you can browse the contents of tar and zip archives on the IA and only download individual files from it. For example, https://archive.org/download/fanfictiondotnet_repack/Fanfiction_Q.zip/ (Don't attempt to do this with the larger zips though; my browser was not amused.) [17:31] *** Smiley has joined #archiveteam-bs [17:34] arkiver: It looks like the Imzy project doesn't work for the Warrior VM since the www.imzy.com domain requires TLS 1.2 [17:35] *** Simpbrain has quit IRC (Ping timeout: 506 seconds) [17:37] ah... [17:40] *** pizzaiolo has quit IRC (Quit: pizzaiolo) [18:08] *** icedice has joined #archiveteam-bs [18:09] jrwr: My item is unavailable due to content issues, not sure why. It looks like i'll have to email IA to find out why, maybe it didn't like one of the files. [18:11] *** SHODAN_UI has quit IRC (Remote host closed the connection) [18:23] tsp_: You can replace /details/ with /history/ in the item’s URL. Usually there is a reason in the logs. [18:29] PurpleSym: I'm there (http://archive.org/history/misc_web_rips). Doesn't seem like there are any tasks, though I can't tell if "server readonly -- tasks waiting for harddrive fix" is a task or not. [18:30] That item does not exist. [18:32] You probably meant misc_website_rips: https://catalogd.archive.org/log/682094076 [18:32] Sorry, yeah, just noticed that. [18:37] Antivirus got it. Is there anything I can do about it now that I know what the problem is? [18:37] PurpleSym: Thanks, didn't know about history. [18:37] Your best bet is probably to create a new item without the offending data [18:42] MrRadar: With the understanding that any new sites I add might disable the new item due to malware? [18:42] You might want to switch to 1 item per site then [18:44] The biggest issue with that is coming up with identifiers. [18:45] sitename_grabdate ? [18:45] *** tuluu has quit IRC (Remote host closed the connection) [18:47] *** tuluu has joined #archiveteam-bs [18:47] That might work, if i can get grabdates out of these files. [19:02] arkiver: We may want to reduce the Imzy rate limit a bit. I see tons of timeouts. [19:02] And 504s [19:22] Better now [19:30] tapedrive: i also just uploaded this one recently https://archive.org/details/Fanfictiondotnet1011dump [19:30] roughly 750k more stories, with a metadata db using the same schema [19:30] bsmith093: I'm extracting that one now ;) (taking forever on a 256mb Raspberry Pi XD ) [19:31] Thanks for your great work! [19:32] im also doing fictionpress and ao3 in that format, but as a simple scrape like the others. Redudancy is good, though, and i only started this because it was really easy to give fanficfare a giant list. [19:33] I'm doing it in the same format as you to make it easy to merge them (which is what I'm doing now) and work out how many are actually missing [19:34] tapedrive: to save some time, 130k of the first million stories on ffnet are all thats left of the oldest group. [19:34] Yeah - I have a mysql database of all of them now [19:35] Saying if they've been archived, deleted, or still need to be archived [19:35] And a script that runs looking at fanfiction.net for new stories and updates, and updates the database accordingly. [19:35] Scrapers query the databse and get new stories to archive/rearchive [19:42] I have a fanfiction.net db I've been collecting for a while, kind of an oddball format but it can be parsed. If someone can generate a list of story ids that are missing, I can go through and check which ones I have. [19:43] In a few days (hopefully) I'll have a list of all the ones that I don't have (which will include the 2012 scrape and bsmith093's multiple scrapes) [19:44] what's going on with fanfiction.net? [19:44] Nothing, yet. [19:45] But as far as I can find out the site's run by either a single person or a small team, that never reply to emails, tweets or comments. [19:45] And every month the site has different problems, which sometimes get fixed, and more often don't. [19:46] I mostly collect Harry Potter, but a few others got in as well. [19:47] oh, yeah, I mean, it's good that it's being archived. [19:47] I was just wondering if there was a proximate cause to discussion [19:48] It's only come up because my crawl has finished recently. Nothing's happend with the site [19:50] *** medowar has joined #archiveteam-bs [19:55] *** wp494_ has joined #archiveteam-bs [19:55] *** luckcolor has quit IRC (Read error: Operation timed out) [19:55] *** dboard has quit IRC (Read error: Operation timed out) [19:55] *** luckcolor has joined #archiveteam-bs [19:56] *** Petri152 has quit IRC (Read error: Operation timed out) [19:56] *** dboard has joined #archiveteam-bs [19:56] *** JAA has quit IRC (Read error: Operation timed out) [19:56] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [19:56] *** ndiddy has quit IRC (Read error: Operation timed out) [19:56] *** K4k has quit IRC (Read error: Operation timed out) [19:56] *** mundus20- has quit IRC (Write error: Broken pipe) [19:56] *** aschmitz has quit IRC (Read error: Operation timed out) [19:56] *** mhazinsk has quit IRC (Read error: Operation timed out) [19:56] *** jrwr has quit IRC (Read error: Operation timed out) [19:56] *** will has quit IRC (Read error: Operation timed out) [19:56] *** C4K3 has quit IRC (Read error: Operation timed out) [19:56] *** ZexaronS has quit IRC (Read error: Operation timed out) [19:57] *** Stilett0 has quit IRC (Read error: Operation timed out) [19:57] *** Stilett0 has joined #archiveteam-bs [19:57] *** K4k has joined #archiveteam-bs [19:57] *** PotcFdk has quit IRC (Read error: Operation timed out) [19:58] *** mundus201 has joined #archiveteam-bs [19:58] *** rocode has quit IRC (Read error: Operation timed out) [19:58] *** rocode has joined #archiveteam-bs [19:58] *** jrwr has joined #archiveteam-bs [19:58] *** ZexaronS has joined #archiveteam-bs [19:58] *** ndiddy has joined #archiveteam-bs [19:58] *** mhazinsk has joined #archiveteam-bs [19:59] *** Lord_Nigh has joined #archiveteam-bs [20:00] *** will has joined #archiveteam-bs [20:00] *** wp494 has quit IRC (Read error: Operation timed out) [20:01] *** PotcFdk has joined #archiveteam-bs [20:02] *** JAA has joined #archiveteam-bs [20:02] *** Petri152 has joined #archiveteam-bs [20:03] *** C4K3 has joined #archiveteam-bs [20:14] *** SHODAN_UI has joined #archiveteam-bs [20:39] How should I turn a URL into an identifier? I can tell which page I started from, for example: http://home.att.net/~polliwog-press/pollistoryindex.htm [20:41] If I do some manual editing, I can get it to home.att.net__polliwog-press_2009-12-12, however I'm not sure if I should turn ~ into _ or remove it altogether to avoid the double _. [20:43] archivebot turns it into an item named after just the hostname and the date of capture [20:43] (and the pipeline) [20:43] so you could call it archiveteam-ondemand_home.att.net_2016-06-12 or whatever [20:44] There might be a few home.att.net rips in my collection, of different people. They're each in one .boe file. [20:45] archivebot sometimes puts multiple crawl jobs in the same IA item, which isn't an issue [20:46] *** th1x has joined #archiveteam-bs [20:46] I tried to put all my backups in one item, but the antivirus scanner found something and killed the entire item instead of the one file. [20:47] huh [20:47] email info@archive.org and tell them it's a false positive [20:47] Maybe it's not. But I'd rather just have that one file excluded rather than the entire thing. [20:47] I'll email, thanks. [20:52] xmc: But as someone else said, it might be a better idea to upload all of them as individual items. But I'm finding that kind of difficult to work with atm. [21:05] *** schbirid has quit IRC (Quit: Leaving) [21:07] https://archive.org/about/faqs.php <- some of the content removal requests in the forum at the bottom of that page are weird... one page was about what looks like a foreign-langauge 'carfacts' equivalent site for a specific vehicle, which was not actually taken down... [21:07] guessing someone was trying to hide some prior damage or something [21:07] and the archive.org admins wisely decided to ignore the request [21:09] huh, this might be related to the robots.txt thing: https://archive.org/post/1074464/robotstxt-processing-failure although that might just be 'collateral damage' from the recent change and not a bug? [21:09] i.e. if www.example.com and example.com serve different robots.txt files, both will use the one from www.example.com [22:04] *** SHODAN_UI has quit IRC (Remote host closed the connection) [22:05] *** wp494_ is now known as wp494 [22:14] *** Gilfoyle has quit IRC (Read error: Operation timed out) [22:54] *** BlueMaxim has joined #archiveteam-bs [23:13] *** nyany has quit IRC (Remote host closed the connection) [23:27] yipdw: you around by any chance [23:28] *** nyany has joined #archiveteam-bs [23:56] *** dashcloud has quit IRC (Read error: Operation timed out)