#archiveteam-bs 2017-06-12,Mon

↑back Search

Time Nickname Message
00:17 πŸ”— j08nY has quit IRC (Quit: Leaving)
00:19 πŸ”— Ravenloft has joined #archiveteam-bs
00:44 πŸ”— tsp_ Going through old stuff, I have about 200 Offline Explorer backup files containing sites that no longer exist on the live web (I haven't looked at wayback/reocities yet or tried to track them down if they moved). Anyone want these?
00:58 πŸ”— joepie91 tsp_: the answer is always yes :) you can just upload them to an item on archive.org
00:58 πŸ”— joepie91 tsp_: ideally with as much information as you can remember about how you created them
01:08 πŸ”— jrwr if you need any help with it tsp_ let me know
01:12 πŸ”— arkiver the project for imzy is started
01:13 πŸ”— jrwr the warrior?
01:13 πŸ”— arkiver yes
01:14 πŸ”— arkiver https://tracker.archiveteam.org/imzy/
01:14 πŸ”— arkiver FOS is the target
01:14 πŸ”— jrwr if it gets bogged down you can use jrwr.io
01:14 πŸ”— arkiver thanks
01:14 πŸ”— jrwr wheres the github
01:14 πŸ”— arkiver will keep that in mind
01:14 πŸ”— arkiver imzy-grab
01:14 πŸ”— jrwr you have free use of that server until the 6th of the month
01:14 πŸ”— jrwr then it expires
01:14 πŸ”— arkiver sounds good
01:14 πŸ”— jrwr wat
01:14 πŸ”— jrwr Resource Limit Is Reached
01:14 πŸ”— jrwr on the AT wiki
01:15 πŸ”— wp494 has joined #archiveteam-bs
01:17 πŸ”— jrwr Spinning up Virtual Machines
01:17 πŸ”— arkiver greenie: we have started
01:17 πŸ”— arkiver with archiving imzy
01:17 πŸ”— jrwr 20 VMs about to land
01:18 πŸ”— arkiver well, the site is a little slow
01:18 πŸ”— arkiver they'll mostly be limited by the tracker
01:18 πŸ”— * jrwr turns off 15 of them
01:18 πŸ”— MrRadar Is there a channel for Imzy?
01:19 πŸ”— jrwr no
01:19 πŸ”— jrwr we can all hang out in here :p
01:19 πŸ”— MrRadar What's the recommended concurrent?
01:19 πŸ”— jrwr 6
01:19 πŸ”— MrRadar OK. I'll start there
01:19 πŸ”— * arkiver is afk
01:19 πŸ”— arkiver ping me if you see anything strange
01:20 πŸ”— tfgbd_znc has quit IRC (Read error: Operation timed out)
01:32 πŸ”— tsp_ Thanks.
01:34 πŸ”— tsp_ jrwr: Can I Upload them under one item? If so, what identifier do I use?
01:34 πŸ”— jrwr opensource
01:34 πŸ”— jrwr web
01:34 πŸ”— jrwr and just use a name others can use to find it
01:36 πŸ”— tsp_ I don't know how people will find it, it's about 200 individual files with different sites in them.
01:37 πŸ”— jrwr are they named well?
01:37 πŸ”— jrwr the files
01:37 πŸ”— jrwr like what domain they came off of
01:37 πŸ”— tsp_ Not really
01:38 πŸ”— jrwr Hrm
01:38 πŸ”— jrwr "Collection of archived websites"
01:39 πŸ”— jrwr and just upload them
01:39 πŸ”— * tsp_ nods
01:39 πŸ”— jrwr any details you can put in the desc helps
01:39 πŸ”— tsp_ I can extract a list of base URLs from most of them, there's a .dat inside the archives which can tell me where they came from. But it's more difficult to extract what settings I used to archive them.
01:40 πŸ”— jrwr if you can prepare the list of domains/URLs and put them in the desc
01:40 πŸ”— jrwr they will be indexed so if someone goes looking for it
01:41 πŸ”— * tsp_ nods
01:41 πŸ”— jrwr you can start the upload now if you want
01:41 πŸ”— jrwr you can edit the desc after the fact
01:42 πŸ”— tsp_ Ok. How do I start uploading, the ia command line utility?
01:42 πŸ”— jrwr there is
01:42 πŸ”— tsp_ Can I also rename and delete files after the fact? Some of these might not be dead.
01:43 πŸ”— jrwr You can
01:43 πŸ”— jrwr but I don't suggest it, since its always good to have some kind of backup in case
01:44 πŸ”— jrwr http://internetarchive.readthedocs.io/en/latest/cli.html
01:44 πŸ”— jrwr the web interface does work pretty well
01:45 πŸ”— tsp_ I can also upload another 200 live ones ripped from the same time, I just separated out the dead ones a while ago.
01:45 πŸ”— jrwr never hurts
01:45 πŸ”— tsp_ I don't think the web interface lets me upload hundreds of files in bulk, though.
01:46 πŸ”— jrwr no, it does not like that
01:46 πŸ”— jrwr the ia cli command does
01:47 πŸ”— tsp_ I still want to keep live and dead separated, I'll focus on the dead ones since they're the most important then figure out what to do with the rest.
01:47 πŸ”— jrwr Yep
01:47 πŸ”— jrwr if you like saving sites, I do suggest warcproxy
01:48 πŸ”— jrwr https://github.com/internetarchive/warcprox
01:48 πŸ”— tsp_ I forgot about that. I download fanfiction sites sometimes in bulk, will it work for, say, 10 GB of data next run?
01:48 πŸ”— jrwr if you are browsing websites
01:49 πŸ”— jrwr that proxy saves it in a way that will flat out import into wayback machine
01:50 πŸ”— jrwr I wonder if there is a effort to save onion sites to IA
01:51 πŸ”— xmc i don't know of one, but i would very much approve
01:55 πŸ”— jrwr I would use a fork of archivebot
01:55 πŸ”— jrwr since it has all the ignore lists
02:04 πŸ”— tsp_ I'll just call it MiscWebsiteRips or something.
02:18 πŸ”— tsp_ jrwr: I guess I want texts or data (probably data since these are archives) and not web, since they're not warc files?
02:19 πŸ”— jrwr its still from the web
02:19 πŸ”— jrwr so its best if you put them under web and community texts
02:19 πŸ”— jrwr works the best really
02:27 πŸ”— tsp_ Oh, so ia upload MiscWebsiteRips *.boe --metadata=mediatype:web
02:28 πŸ”— tsp_ How would I get it under opensource?
02:31 πŸ”— Somebody2 tsp_: it will default to being in the opensource collection, I think.
02:35 πŸ”— tsp_ Is there a preference to identifier names? MiscWebsiteRips vs misc_website_rips vs misc-website-rips?
02:37 πŸ”— pizzaiolo has quit IRC (pizzaiolo)
02:38 πŸ”— Somebody2 tsp_: not between those choices, no. Probably the underscore version is safest.
02:38 πŸ”— Somebody2 as there are file systems without case-sensitivity, and shell tools that grumble at dashes.
02:39 πŸ”— Somebody2 I'd personally recommend uploading each of the sites into *separate* items, rather than one, though.
02:39 πŸ”— Somebody2 That way you can have separate metadata values for each site
02:40 πŸ”— Somebody2 but uploading *anything* is much better than not.
02:41 πŸ”— tsp_ Somebody2: Can I split them out later? Right now they're kind of a mess, and I'd have to go through and figure out when things were downloaded, which ones have images, only partial sites, etc
02:42 πŸ”— tsp_ Oh, there's an ia move command, so I can.
02:42 πŸ”— xmc i think that only moves files within items
02:42 πŸ”— xmc but i'm not sure
02:43 πŸ”— Somebody2 You absolutely should upload them as one item if it would take work to separate them.
02:43 πŸ”— Somebody2 tsp_:
02:44 πŸ”— Somebody2 There's nothing at *all* wrong with immediately uploading a "raw" item with the pile-of-bits you currently have ...
02:44 πŸ”— xmc oh, huh, it might let you move files from one item to another!
02:45 πŸ”— Somebody2 ... and then later uploading cleaned up, split up items with better metadata later.
02:45 πŸ”— Somebody2 If/when IA wants to de-duplicate the data they store, they can and will do so -- we uploaders should not worry about it
02:46 πŸ”— Somebody2 (unless you are getting up into the tens of terabytes range)
02:46 πŸ”— jrwr xmc: archivebot is "interesting" to setup
02:47 πŸ”— xmc heh i bet
02:47 πŸ”— tsp_ The dead collection is only 2.6 GB, so not much.
02:48 πŸ”— Somebody2 tsp_: Yeah, at 3 GB, don't hesistate to upload a combined version now and split ones later.
02:53 πŸ”— jrwr xmc: Ill have it run in #TorArchiveBot
02:53 πŸ”— jrwr since I don't want to rewrite the commands so it can exist in #ArchiveBot
02:53 πŸ”— xmc sounds good to me!
02:55 πŸ”— Somebody2 jrwr: wait, why are you running a copy of archivebot?
02:55 πŸ”— jrwr I want to make a Tor version of it
02:55 πŸ”— jrwr to archive tor websites and maybe other darknets
03:06 πŸ”— MrRadar arkiver: I'm getting an infinite sequence of HTTP 206 responses for a URL for Imzy: 87=206 https://www.imzy.com/api/accounts/profiles/mcnulty?check=true
03:22 πŸ”— Somebody2 jrwr: cool. I'm a bit doubtful about how many darknet websites there are that are sufficiently public that I'd support archiving them into a public repository like archive.org
03:22 πŸ”— Somebody2 but it's good to have available
03:30 πŸ”— xmc idk, archive everything on .onion that you can touch
03:35 πŸ”— Somebody2 xmc: presumably not the entirity of facebook... :-)
03:37 πŸ”— gui7 has quit IRC (Read error: Operation timed out)
03:38 πŸ”— xmc well, sure
04:01 πŸ”— Frogging This is off-topic, but does anyone know what to do about an insect that is crawling around inside an LCD
04:01 πŸ”— Frogging Like behind the panel
04:03 πŸ”— Frogging I've turned it off hoping it'll leave, dunno what else can be done. Probably nothing short of taking it apart :/
04:53 πŸ”— Sk1d has quit IRC (Ping timeout: 250 seconds)
04:58 πŸ”— Yurume has quit IRC (Remote host closed the connection)
05:00 πŸ”— Sk1d has joined #archiveteam-bs
05:07 πŸ”— Yurume has joined #archiveteam-bs
05:16 πŸ”— Lord_Nigh Somebody2: I'm hearing comments which imply "something" happened regarding the robots.txt thing i mentioned earlier, and none of the IA staff are allowed to talk about it
05:32 πŸ”— xmc Lord_Nigh: what kind of something
05:32 πŸ”— Lord_Nigh not sure. just comments from ia staff on twitter saying they can't talk about it, when asked flat out
05:33 πŸ”— Lord_Nigh the faq was never updated to note that retroactive blocking is once again possible (and in fact expanded) either
05:34 πŸ”— xmc the reveal of this has been completely bungled and i'm not very pleased
05:35 πŸ”— Lord_Nigh see https://twitter.com/TheMogMiner/status/873950228994502658
05:36 πŸ”— Lord_Nigh themogminer got it wrong about the faq being revised; the faq was NEVER revised (or it wasn't 2 weeks ago, i didn't look since) and still implies that retroactive blocking isn't possible
05:45 πŸ”— Lord_Nigh the faqs page looks like it was updated, but is even less clear about removal and retroactive stuff than before
05:46 πŸ”— Lord_Nigh no, i'm wrong
05:46 πŸ”— Lord_Nigh its the same from 2 years ago:
05:46 πŸ”— Lord_Nigh "How can I have my site's pages excluded from the Wayback Machine?
05:46 πŸ”— Lord_Nigh You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message. "
05:46 πŸ”— Lord_Nigh before 2 years ago it was:
05:47 πŸ”— Lord_Nigh "How can I have my site's pages excluded from the Wayback Machine?
05:47 πŸ”— Lord_Nigh You can exclude your site from display in the Wayback Machine by placing a simple robots.txt file on your Web server.
05:47 πŸ”— Lord_Nigh Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.
05:47 πŸ”— Lord_Nigh If you are emailing to ask that your website not be archived, please note that you'll need to include the url (web address) in the text of your message. "
05:48 πŸ”— Lord_Nigh the "directions on how to automatically exclude your site" was a link to https://web.archive.org/web/20130606003203/http://archive.org/about/exclude.php
05:48 πŸ”— Lord_Nigh faq from 6/2013 is https://web.archive.org/web/20130606003203/http://archive.org/about/faqs.php
05:49 πŸ”— Lord_Nigh so the faq was never updated to restore the old verbage, and the exclude.php is a 404 currently
05:49 πŸ”— Frogging I thought they were making the robots.txt handling more permissive, not less
05:49 πŸ”— Frogging at least that's what I got from people talking about it in here
05:50 πŸ”— Lord_Nigh the info on exclude.php in the archive isn't valid anymore either, since archive.org respects user agent of "*" now, it doesn't explicitly look for "ia_archiver" like it did 2 or 3 years ago
05:50 πŸ”— Frogging (permissive meaning blocking fewer sites)
05:50 πŸ”— Lord_Nigh Frogging: exactly! this is really mysterious
05:50 πŸ”— Lord_Nigh for 2 or 3 years it only respected the robots.txt that was present when the site was archived, and did not allow 'retroactive' blocking at all
05:50 πŸ”— Lord_Nigh i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archiver
05:50 πŸ”— Lord_Nigh er
05:50 πŸ”— Lord_Nigh i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archive
05:51 πŸ”— Lord_Nigh but if later or earlier the robots.txt didn't exist or didnt block "ia_archiver" it would happily let the site display
05:52 πŸ”— Lord_Nigh the behavior now is if robots.txt blocks any sort of robot by "*" user agent (and probably also "ia_archiver" though I have not tested that), and that file is currently live, it will block EVERY ARCHIVED VERSION OF THAT SITE, EVER
05:52 πŸ”— Lord_Nigh this is a major problem with malicious squatters, who have gone right back to abusing the system like they did 3 years ago
05:54 πŸ”— Lord_Nigh so this is why i am majorly concerned, especially since the https://twitter.com/TheMogMiner/status/873950228994502658 text implies there may be a legal/court order preventing IA staff from talking about it, let alone fixing it
05:54 πŸ”— xmc yes
05:54 πŸ”— xmc i'm going to bed
05:55 πŸ”— Lord_Nigh and since i haven't heard any noise on slashdot or ycombinator about this, i assume it is a sealed court order, or even possibly an NSL
05:55 πŸ”— Lord_Nigh which is even creepier
05:56 πŸ”— yipdw could even be Majestic 12 influence
05:56 πŸ”— yipdw πŸ€”
05:56 πŸ”— pikhq I'd be inferring something to do with legal system bustedness, that's for sure.
05:58 πŸ”— yipdw lol that tweet
05:58 πŸ”— Frogging really doubt it's an NSL, I'm sure there are channels for that and they do not involve robots.txt
05:58 πŸ”— yipdw inferring conspiracies from "not allowed to comment"
05:59 πŸ”— yipdw Warren Spector missed the mark on Deus Ex
05:59 πŸ”— pikhq Yeah, I'm loving that they're inferring *anything* more involved than "something something legal system".
06:00 πŸ”— pikhq (hell, it might not even be a sealed court order, just a court order we're ignorant of, and "not allowed to comment" is more "IA's lawyers would hate me for saying anything")
06:01 πŸ”— pikhq (or a settlement, or, well. Anything, really. Not much to go off of.)
06:01 πŸ”— Lord_Nigh sealed settlement would suck since that means it can never change again
06:02 πŸ”— Lord_Nigh I'm probably reading way too much into those comments
06:02 πŸ”— pikhq I'd not be confident reading much more into it than "there is probably a lawyer involved in this".
06:21 πŸ”— tsp_ I started uploading misc_website_rips with ia, but it stopped here: error uploading The Fanfic Vault.boe: We encountered an internal error. Please try again. - uploadItem.py: uploading id = misc_website_rips using contribSubmit!
06:22 πŸ”— tsp_ I didn't see anything in the docs about what to do next, most importantly, will it resume if I run the same command again with the list of previously-uploaded files?
06:59 πŸ”— SHODAN_UI has joined #archiveteam-bs
07:04 πŸ”— schbirid has joined #archiveteam-bs
07:50 πŸ”— j08nY has joined #archiveteam-bs
08:11 πŸ”— BlueMaxim has quit IRC (Quit: Leaving)
09:07 πŸ”— SHODAN_UI has quit IRC (Remote host closed the connection)
10:00 πŸ”— pizzaiolo has joined #archiveteam-bs
10:39 πŸ”— j08nY has quit IRC (Read error: Operation timed out)
11:22 πŸ”— SHODAN_UI has joined #archiveteam-bs
12:02 πŸ”— j08nY has joined #archiveteam-bs
12:51 πŸ”— dcmorton has quit IRC (Quit: ZNC - http://znc.in)
12:56 πŸ”— dcmorton has joined #archiveteam-bs
13:45 πŸ”— Odd0002 has quit IRC (Remote host closed the connection)
15:20 πŸ”— jrwr tsp_: correct it will resume
15:22 πŸ”— jrwr "If you think the Archive is a bad actor you are 12 pineapples short of a luau"
15:22 πŸ”— jrwr Word-Smithing at its finest
15:35 πŸ”— BartoCH has quit IRC (Ping timeout: 260 seconds)
15:46 πŸ”— BartoCH has joined #archiveteam-bs
15:52 πŸ”— pizzaiolo has quit IRC (Quit: pizzaiolo)
16:11 πŸ”— Pudsey has joined #archiveteam-bs
16:27 πŸ”— bsmith093 tapedrive: if i could make a suggestion, archive.org/details/fanfictiondotnet_repack much better version than one giant tar file. that was my bad.
16:30 πŸ”— tapedrive I've actually downladed that (the zip of the entire item) a few weeks ago, and I'll be integrating that into my archive and database - along with the 2012 scrapes - which will allow us to see how many stories we're actually missing overall.
16:32 πŸ”— tapedrive And also thanks for doing that scrape, bsmith093, it's been a lifesaver to me many times
16:39 πŸ”— Pudsey has quit IRC (Remote host closed the connection)
16:42 πŸ”— pizzaiolo has joined #archiveteam-bs
16:48 πŸ”— tuluu has joined #archiveteam-bs
16:59 πŸ”— tuluu has quit IRC (Ping timeout: 260 seconds)
17:01 πŸ”— tuluu has joined #archiveteam-bs
17:13 πŸ”— mls_ has quit IRC (Read error: Connection reset by peer)
17:14 πŸ”— godane https://archive.org/details/godaneinbox?sort=-publicdate&&and[]=subject%3A%22Sports%20Illustrated%22
17:14 πŸ”— mls_ has joined #archiveteam-bs
17:25 πŸ”— JAA bsmith093: Although you're right in that several smaller files are easier to handle, you can browse the contents of tar and zip archives on the IA and only download individual files from it. For example, https://archive.org/download/fanfictiondotnet_repack/Fanfiction_Q.zip/ (Don't attempt to do this with the larger zips though; my browser was not amused.)
17:31 πŸ”— Smiley has joined #archiveteam-bs
17:34 πŸ”— MrRadar arkiver: It looks like the Imzy project doesn't work for the Warrior VM since the www.imzy.com domain requires TLS 1.2
17:35 πŸ”— Simpbrain has quit IRC (Ping timeout: 506 seconds)
17:37 πŸ”— arkiver ah...
17:40 πŸ”— pizzaiolo has quit IRC (Quit: pizzaiolo)
18:08 πŸ”— icedice has joined #archiveteam-bs
18:09 πŸ”— tsp_ jrwr: My item is unavailable due to content issues, not sure why. It looks like i'll have to email IA to find out why, maybe it didn't like one of the files.
18:11 πŸ”— SHODAN_UI has quit IRC (Remote host closed the connection)
18:23 πŸ”— PurpleSym tsp_: You can replace /details/ with /history/ in the item’s URL. Usually there is a reason in the logs.
18:29 πŸ”— tsp_ PurpleSym: I'm there (http://archive.org/history/misc_web_rips). Doesn't seem like there are any tasks, though I can't tell if "server readonly -- tasks waiting for harddrive fix" is a task or not.
18:30 πŸ”— PurpleSym That item does not exist.
18:32 πŸ”— PurpleSym You probably meant misc_website_rips: https://catalogd.archive.org/log/682094076
18:32 πŸ”— tsp_ Sorry, yeah, just noticed that.
18:37 πŸ”— tsp_ Antivirus got it. Is there anything I can do about it now that I know what the problem is?
18:37 πŸ”— tsp_ PurpleSym: Thanks, didn't know about history.
18:37 πŸ”— MrRadar Your best bet is probably to create a new item without the offending data
18:42 πŸ”— tsp_ MrRadar: With the understanding that any new sites I add might disable the new item due to malware?
18:42 πŸ”— MrRadar You might want to switch to 1 item per site then
18:44 πŸ”— tsp_ The biggest issue with that is coming up with identifiers.
18:45 πŸ”— MrRadar sitename_grabdate ?
18:45 πŸ”— tuluu has quit IRC (Remote host closed the connection)
18:47 πŸ”— tuluu has joined #archiveteam-bs
18:47 πŸ”— tsp_ That might work, if i can get grabdates out of these files.
19:02 πŸ”— JAA arkiver: We may want to reduce the Imzy rate limit a bit. I see tons of timeouts.
19:02 πŸ”— JAA And 504s
19:22 πŸ”— JAA Better now
19:30 πŸ”— bsmith093 tapedrive: i also just uploaded this one recently https://archive.org/details/Fanfictiondotnet1011dump
19:30 πŸ”— bsmith093 roughly 750k more stories, with a metadata db using the same schema
19:30 πŸ”— tapedrive bsmith093: I'm extracting that one now ;) (taking forever on a 256mb Raspberry Pi XD )
19:31 πŸ”— tapedrive Thanks for your great work!
19:32 πŸ”— bsmith093 im also doing fictionpress and ao3 in that format, but as a simple scrape like the others. Redudancy is good, though, and i only started this because it was really easy to give fanficfare a giant list.
19:33 πŸ”— tapedrive I'm doing it in the same format as you to make it easy to merge them (which is what I'm doing now) and work out how many are actually missing
19:34 πŸ”— bsmith093 tapedrive: to save some time, 130k of the first million stories on ffnet are all thats left of the oldest group.
19:34 πŸ”— tapedrive Yeah - I have a mysql database of all of them now
19:35 πŸ”— tapedrive Saying if they've been archived, deleted, or still need to be archived
19:35 πŸ”— tapedrive And a script that runs looking at fanfiction.net for new stories and updates, and updates the database accordingly.
19:35 πŸ”— tapedrive Scrapers query the databse and get new stories to archive/rearchive
19:42 πŸ”— tsp_ I have a fanfiction.net db I've been collecting for a while, kind of an oddball format but it can be parsed. If someone can generate a list of story ids that are missing, I can go through and check which ones I have.
19:43 πŸ”— tapedrive In a few days (hopefully) I'll have a list of all the ones that I don't have (which will include the 2012 scrape and bsmith093's multiple scrapes)
19:44 πŸ”— nightpool what's going on with fanfiction.net?
19:44 πŸ”— tapedrive Nothing, yet.
19:45 πŸ”— tapedrive But as far as I can find out the site's run by either a single person or a small team, that never reply to emails, tweets or comments.
19:45 πŸ”— tapedrive And every month the site has different problems, which sometimes get fixed, and more often don't.
19:46 πŸ”— tsp_ I mostly collect Harry Potter, but a few others got in as well.
19:47 πŸ”— nightpool oh, yeah, I mean, it's good that it's being archived.
19:47 πŸ”— nightpool I was just wondering if there was a proximate cause to discussion
19:48 πŸ”— tapedrive It's only come up because my crawl has finished recently. Nothing's happend with the site
19:50 πŸ”— medowar has joined #archiveteam-bs
19:55 πŸ”— wp494_ has joined #archiveteam-bs
19:55 πŸ”— luckcolor has quit IRC (Read error: Operation timed out)
19:55 πŸ”— dboard has quit IRC (Read error: Operation timed out)
19:55 πŸ”— luckcolor has joined #archiveteam-bs
19:56 πŸ”— Petri152 has quit IRC (Read error: Operation timed out)
19:56 πŸ”— dboard has joined #archiveteam-bs
19:56 πŸ”— JAA has quit IRC (Read error: Operation timed out)
19:56 πŸ”— Lord_Nigh has quit IRC (Read error: Operation timed out)
19:56 πŸ”— ndiddy has quit IRC (Read error: Operation timed out)
19:56 πŸ”— K4k has quit IRC (Read error: Operation timed out)
19:56 πŸ”— mundus20- has quit IRC (Write error: Broken pipe)
19:56 πŸ”— aschmitz has quit IRC (Read error: Operation timed out)
19:56 πŸ”— mhazinsk has quit IRC (Read error: Operation timed out)
19:56 πŸ”— jrwr has quit IRC (Read error: Operation timed out)
19:56 πŸ”— will has quit IRC (Read error: Operation timed out)
19:56 πŸ”— C4K3 has quit IRC (Read error: Operation timed out)
19:56 πŸ”— ZexaronS has quit IRC (Read error: Operation timed out)
19:57 πŸ”— Stilett0 has quit IRC (Read error: Operation timed out)
19:57 πŸ”— Stilett0 has joined #archiveteam-bs
19:57 πŸ”— K4k has joined #archiveteam-bs
19:57 πŸ”— PotcFdk has quit IRC (Read error: Operation timed out)
19:58 πŸ”— mundus201 has joined #archiveteam-bs
19:58 πŸ”— rocode has quit IRC (Read error: Operation timed out)
19:58 πŸ”— rocode has joined #archiveteam-bs
19:58 πŸ”— jrwr has joined #archiveteam-bs
19:58 πŸ”— ZexaronS has joined #archiveteam-bs
19:58 πŸ”— ndiddy has joined #archiveteam-bs
19:58 πŸ”— mhazinsk has joined #archiveteam-bs
19:59 πŸ”— Lord_Nigh has joined #archiveteam-bs
20:00 πŸ”— will has joined #archiveteam-bs
20:00 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
20:01 πŸ”— PotcFdk has joined #archiveteam-bs
20:02 πŸ”— JAA has joined #archiveteam-bs
20:02 πŸ”— Petri152 has joined #archiveteam-bs
20:03 πŸ”— C4K3 has joined #archiveteam-bs
20:14 πŸ”— SHODAN_UI has joined #archiveteam-bs
20:39 πŸ”— tsp_ How should I turn a URL into an identifier? I can tell which page I started from, for example: http://home.att.net/~polliwog-press/pollistoryindex.htm
20:41 πŸ”— tsp_ If I do some manual editing, I can get it to home.att.net__polliwog-press_2009-12-12, however I'm not sure if I should turn ~ into _ or remove it altogether to avoid the double _.
20:43 πŸ”— xmc archivebot turns it into an item named after just the hostname and the date of capture
20:43 πŸ”— xmc (and the pipeline)
20:43 πŸ”— xmc so you could call it archiveteam-ondemand_home.att.net_2016-06-12 or whatever
20:44 πŸ”— tsp_ There might be a few home.att.net rips in my collection, of different people. They're each in one .boe file.
20:45 πŸ”— xmc archivebot sometimes puts multiple crawl jobs in the same IA item, which isn't an issue
20:46 πŸ”— th1x has joined #archiveteam-bs
20:46 πŸ”— tsp_ I tried to put all my backups in one item, but the antivirus scanner found something and killed the entire item instead of the one file.
20:47 πŸ”— xmc huh
20:47 πŸ”— xmc email info@archive.org and tell them it's a false positive
20:47 πŸ”— tsp_ Maybe it's not. But I'd rather just have that one file excluded rather than the entire thing.
20:47 πŸ”— tsp_ I'll email, thanks.
20:52 πŸ”— tsp_ xmc: But as someone else said, it might be a better idea to upload all of them as individual items. But I'm finding that kind of difficult to work with atm.
21:05 πŸ”— schbirid has quit IRC (Quit: Leaving)
21:07 πŸ”— Lord_Nigh https://archive.org/about/faqs.php <- some of the content removal requests in the forum at the bottom of that page are weird... one page was about what looks like a foreign-langauge 'carfacts' equivalent site for a specific vehicle, which was not actually taken down...
21:07 πŸ”— Lord_Nigh guessing someone was trying to hide some prior damage or something
21:07 πŸ”— Lord_Nigh and the archive.org admins wisely decided to ignore the request
21:09 πŸ”— Lord_Nigh huh, this might be related to the robots.txt thing: https://archive.org/post/1074464/robotstxt-processing-failure although that might just be 'collateral damage' from the recent change and not a bug?
21:09 πŸ”— Lord_Nigh i.e. if www.example.com and example.com serve different robots.txt files, both will use the one from www.example.com
22:04 πŸ”— SHODAN_UI has quit IRC (Remote host closed the connection)
22:05 πŸ”— wp494_ is now known as wp494
22:14 πŸ”— Gilfoyle has quit IRC (Read error: Operation timed out)
22:54 πŸ”— BlueMaxim has joined #archiveteam-bs
23:13 πŸ”— nyany has quit IRC (Remote host closed the connection)
23:27 πŸ”— jrwr yipdw: you around by any chance
23:28 πŸ”— nyany has joined #archiveteam-bs
23:56 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)

irclogger-viewer