#archiveteam-bs 2017-06-12,Mon

↑back Search

Time	Nickname	Message
00:17 ^🔗		j08nY has quit IRC (Quit: Leaving)
00:19 ^🔗		Ravenloft has joined #archiveteam-bs
00:44 ^🔗	tsp_	Going through old stuff, I have about 200 Offline Explorer backup files containing sites that no longer exist on the live web (I haven't looked at wayback/reocities yet or tried to track them down if they moved). Anyone want these?
00:58 ^🔗	joepie91	tsp_: the answer is always yes :) you can just upload them to an item on archive.org
00:58 ^🔗	joepie91	tsp_: ideally with as much information as you can remember about how you created them
01:08 ^🔗	jrwr	if you need any help with it tsp_ let me know
01:12 ^🔗	arkiver	the project for imzy is started
01:13 ^🔗	jrwr	the warrior?
01:13 ^🔗	arkiver	yes
01:14 ^🔗	arkiver	https://tracker.archiveteam.org/imzy/
01:14 ^🔗	arkiver	FOS is the target
01:14 ^🔗	jrwr	if it gets bogged down you can use jrwr.io
01:14 ^🔗	arkiver	thanks
01:14 ^🔗	jrwr	wheres the github
01:14 ^🔗	arkiver	will keep that in mind
01:14 ^🔗	arkiver	imzy-grab
01:14 ^🔗	jrwr	you have free use of that server until the 6th of the month
01:14 ^🔗	jrwr	then it expires
01:14 ^🔗	arkiver	sounds good
01:14 ^🔗	jrwr	wat
01:14 ^🔗	jrwr	Resource Limit Is Reached
01:14 ^🔗	jrwr	on the AT wiki
01:15 ^🔗		wp494 has joined #archiveteam-bs
01:17 ^🔗	jrwr	Spinning up Virtual Machines
01:17 ^🔗	arkiver	greenie: we have started
01:17 ^🔗	arkiver	with archiving imzy
01:17 ^🔗	jrwr	20 VMs about to land
01:18 ^🔗	arkiver	well, the site is a little slow
01:18 ^🔗	arkiver	they'll mostly be limited by the tracker
01:18 ^🔗	*	jrwr turns off 15 of them
01:18 ^🔗	MrRadar	Is there a channel for Imzy?
01:19 ^🔗	jrwr	no
01:19 ^🔗	jrwr	we can all hang out in here :p
01:19 ^🔗	MrRadar	What's the recommended concurrent?
01:19 ^🔗	jrwr	6
01:19 ^🔗	MrRadar	OK. I'll start there
01:19 ^🔗	*	arkiver is afk
01:19 ^🔗	arkiver	ping me if you see anything strange
01:20 ^🔗		tfgbd_znc has quit IRC (Read error: Operation timed out)
01:32 ^🔗	tsp_	Thanks.
01:34 ^🔗	tsp_	jrwr: Can I Upload them under one item? If so, what identifier do I use?
01:34 ^🔗	jrwr	opensource
01:34 ^🔗	jrwr	web
01:34 ^🔗	jrwr	and just use a name others can use to find it
01:36 ^🔗	tsp_	I don't know how people will find it, it's about 200 individual files with different sites in them.
01:37 ^🔗	jrwr	are they named well?
01:37 ^🔗	jrwr	the files
01:37 ^🔗	jrwr	like what domain they came off of
01:37 ^🔗	tsp_	Not really
01:38 ^🔗	jrwr	Hrm
01:38 ^🔗	jrwr	"Collection of archived websites"
01:39 ^🔗	jrwr	and just upload them
01:39 ^🔗	*	tsp_ nods
01:39 ^🔗	jrwr	any details you can put in the desc helps
01:39 ^🔗	tsp_	I can extract a list of base URLs from most of them, there's a .dat inside the archives which can tell me where they came from. But it's more difficult to extract what settings I used to archive them.
01:40 ^🔗	jrwr	if you can prepare the list of domains/URLs and put them in the desc
01:40 ^🔗	jrwr	they will be indexed so if someone goes looking for it
01:41 ^🔗	*	tsp_ nods
01:41 ^🔗	jrwr	you can start the upload now if you want
01:41 ^🔗	jrwr	you can edit the desc after the fact
01:42 ^🔗	tsp_	Ok. How do I start uploading, the ia command line utility?
01:42 ^🔗	jrwr	there is
01:42 ^🔗	tsp_	Can I also rename and delete files after the fact? Some of these might not be dead.
01:43 ^🔗	jrwr	You can
01:43 ^🔗	jrwr	but I don't suggest it, since its always good to have some kind of backup in case
01:44 ^🔗	jrwr	http://internetarchive.readthedocs.io/en/latest/cli.html
01:44 ^🔗	jrwr	the web interface does work pretty well
01:45 ^🔗	tsp_	I can also upload another 200 live ones ripped from the same time, I just separated out the dead ones a while ago.
01:45 ^🔗	jrwr	never hurts
01:45 ^🔗	tsp_	I don't think the web interface lets me upload hundreds of files in bulk, though.
01:46 ^🔗	jrwr	no, it does not like that
01:46 ^🔗	jrwr	the ia cli command does
01:47 ^🔗	tsp_	I still want to keep live and dead separated, I'll focus on the dead ones since they're the most important then figure out what to do with the rest.
01:47 ^🔗	jrwr	Yep
01:47 ^🔗	jrwr	if you like saving sites, I do suggest warcproxy
01:48 ^🔗	jrwr	https://github.com/internetarchive/warcprox
01:48 ^🔗	tsp_	I forgot about that. I download fanfiction sites sometimes in bulk, will it work for, say, 10 GB of data next run?
01:48 ^🔗	jrwr	if you are browsing websites
01:49 ^🔗	jrwr	that proxy saves it in a way that will flat out import into wayback machine
01:50 ^🔗	jrwr	I wonder if there is a effort to save onion sites to IA
01:51 ^🔗	xmc	i don't know of one, but i would very much approve
01:55 ^🔗	jrwr	I would use a fork of archivebot
01:55 ^🔗	jrwr	since it has all the ignore lists
02:04 ^🔗	tsp_	I'll just call it MiscWebsiteRips or something.
02:18 ^🔗	tsp_	jrwr: I guess I want texts or data (probably data since these are archives) and not web, since they're not warc files?
02:19 ^🔗	jrwr	its still from the web
02:19 ^🔗	jrwr	so its best if you put them under web and community texts
02:19 ^🔗	jrwr	works the best really
02:27 ^🔗	tsp_	Oh, so ia upload MiscWebsiteRips *.boe --metadata=mediatype:web
02:28 ^🔗	tsp_	How would I get it under opensource?
02:31 ^🔗	Somebody2	tsp_: it will default to being in the opensource collection, I think.
02:35 ^🔗	tsp_	Is there a preference to identifier names? MiscWebsiteRips vs misc_website_rips vs misc-website-rips?
02:37 ^🔗		pizzaiolo has quit IRC (pizzaiolo)
02:38 ^🔗	Somebody2	tsp_: not between those choices, no. Probably the underscore version is safest.
02:38 ^🔗	Somebody2	as there are file systems without case-sensitivity, and shell tools that grumble at dashes.
02:39 ^🔗	Somebody2	I'd personally recommend uploading each of the sites into separate items, rather than one, though.
02:39 ^🔗	Somebody2	That way you can have separate metadata values for each site
02:40 ^🔗	Somebody2	but uploading anything is much better than not.
02:41 ^🔗	tsp_	Somebody2: Can I split them out later? Right now they're kind of a mess, and I'd have to go through and figure out when things were downloaded, which ones have images, only partial sites, etc
02:42 ^🔗	tsp_	Oh, there's an ia move command, so I can.
02:42 ^🔗	xmc	i think that only moves files within items
02:42 ^🔗	xmc	but i'm not sure
02:43 ^🔗	Somebody2	You absolutely should upload them as one item if it would take work to separate them.
02:43 ^🔗	Somebody2	tsp_:
02:44 ^🔗	Somebody2	There's nothing at all wrong with immediately uploading a "raw" item with the pile-of-bits you currently have ...
02:44 ^🔗	xmc	oh, huh, it might let you move files from one item to another!
02:45 ^🔗	Somebody2	... and then later uploading cleaned up, split up items with better metadata later.
02:45 ^🔗	Somebody2	If/when IA wants to de-duplicate the data they store, they can and will do so -- we uploaders should not worry about it
02:46 ^🔗	Somebody2	(unless you are getting up into the tens of terabytes range)
02:46 ^🔗	jrwr	xmc: archivebot is "interesting" to setup
02:47 ^🔗	xmc	heh i bet
02:47 ^🔗	tsp_	The dead collection is only 2.6 GB, so not much.
02:48 ^🔗	Somebody2	tsp_: Yeah, at 3 GB, don't hesistate to upload a combined version now and split ones later.
02:53 ^🔗	jrwr	xmc: Ill have it run in #TorArchiveBot
02:53 ^🔗	jrwr	since I don't want to rewrite the commands so it can exist in #ArchiveBot
02:53 ^🔗	xmc	sounds good to me!
02:55 ^🔗	Somebody2	jrwr: wait, why are you running a copy of archivebot?
02:55 ^🔗	jrwr	I want to make a Tor version of it
02:55 ^🔗	jrwr	to archive tor websites and maybe other darknets
03:06 ^🔗	MrRadar	arkiver: I'm getting an infinite sequence of HTTP 206 responses for a URL for Imzy: 87=206 https://www.imzy.com/api/accounts/profiles/mcnulty?check=true
03:22 ^🔗	Somebody2	jrwr: cool. I'm a bit doubtful about how many darknet websites there are that are sufficiently public that I'd support archiving them into a public repository like archive.org
03:22 ^🔗	Somebody2	but it's good to have available
03:30 ^🔗	xmc	idk, archive everything on .onion that you can touch
03:35 ^🔗	Somebody2	xmc: presumably not the entirity of facebook... :-)
03:37 ^🔗		gui7 has quit IRC (Read error: Operation timed out)
03:38 ^🔗	xmc	well, sure
04:01 ^🔗	Frogging	This is off-topic, but does anyone know what to do about an insect that is crawling around inside an LCD
04:01 ^🔗	Frogging	Like behind the panel
04:03 ^🔗	Frogging	I've turned it off hoping it'll leave, dunno what else can be done. Probably nothing short of taking it apart :/
04:53 ^🔗		Sk1d has quit IRC (Ping timeout: 250 seconds)
04:58 ^🔗		Yurume has quit IRC (Remote host closed the connection)
05:00 ^🔗		Sk1d has joined #archiveteam-bs
05:07 ^🔗		Yurume has joined #archiveteam-bs
05:16 ^🔗	Lord_Nigh	Somebody2: I'm hearing comments which imply "something" happened regarding the robots.txt thing i mentioned earlier, and none of the IA staff are allowed to talk about it
05:32 ^🔗	xmc	Lord_Nigh: what kind of something
05:32 ^🔗	Lord_Nigh	not sure. just comments from ia staff on twitter saying they can't talk about it, when asked flat out
05:33 ^🔗	Lord_Nigh	the faq was never updated to note that retroactive blocking is once again possible (and in fact expanded) either
05:34 ^🔗	xmc	the reveal of this has been completely bungled and i'm not very pleased
05:35 ^🔗	Lord_Nigh	see https://twitter.com/TheMogMiner/status/873950228994502658
05:36 ^🔗	Lord_Nigh	themogminer got it wrong about the faq being revised; the faq was NEVER revised (or it wasn't 2 weeks ago, i didn't look since) and still implies that retroactive blocking isn't possible
05:45 ^🔗	Lord_Nigh	the faqs page looks like it was updated, but is even less clear about removal and retroactive stuff than before
05:46 ^🔗	Lord_Nigh	no, i'm wrong
05:46 ^🔗	Lord_Nigh	its the same from 2 years ago:
05:46 ^🔗	Lord_Nigh	"How can I have my site's pages excluded from the Wayback Machine?
05:46 ^🔗	Lord_Nigh	You can send an email request for us to review to info@archive.org with the URL (web address) in the text of your message. "
05:46 ^🔗	Lord_Nigh	before 2 years ago it was:
05:47 ^🔗	Lord_Nigh	"How can I have my site's pages excluded from the Wayback Machine?
05:47 ^🔗	Lord_Nigh	You can exclude your site from display in the Wayback Machine by placing a simple robots.txt file on your Web server.
05:47 ^🔗	Lord_Nigh	Here are directions on how to automatically exclude your site. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.
05:47 ^🔗	Lord_Nigh	If you are emailing to ask that your website not be archived, please note that you'll need to include the url (web address) in the text of your message. "
05:48 ^🔗	Lord_Nigh	the "directions on how to automatically exclude your site" was a link to https://web.archive.org/web/20130606003203/http://archive.org/about/exclude.php
05:48 ^🔗	Lord_Nigh	faq from 6/2013 is https://web.archive.org/web/20130606003203/http://archive.org/about/faqs.php
05:49 ^🔗	Lord_Nigh	so the faq was never updated to restore the old verbage, and the exclude.php is a 404 currently
05:49 ^🔗	Frogging	I thought they were making the robots.txt handling more permissive, not less
05:49 ^🔗	Frogging	at least that's what I got from people talking about it in here
05:50 ^🔗	Lord_Nigh	the info on exclude.php in the archive isn't valid anymore either, since archive.org respects user agent of "*" now, it doesn't explicitly look for "ia_archiver" like it did 2 or 3 years ago
05:50 ^🔗	Frogging	(permissive meaning blocking fewer sites)
05:50 ^🔗	Lord_Nigh	Frogging: exactly! this is really mysterious
05:50 ^🔗	Lord_Nigh	for 2 or 3 years it only respected the robots.txt that was present when the site was archived, and did not allow 'retroactive' blocking at all
05:50 ^🔗	Lord_Nigh	i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archiver
05:50 ^🔗	Lord_Nigh	er
05:50 ^🔗	Lord_Nigh	i.e. if it archived on june 2012 and there was a robots.txt on that day blocking "ia_archiver" it was blocked in the archive
05:51 ^🔗	Lord_Nigh	but if later or earlier the robots.txt didn't exist or didnt block "ia_archiver" it would happily let the site display
05:52 ^🔗	Lord_Nigh	the behavior now is if robots.txt blocks any sort of robot by "*" user agent (and probably also "ia_archiver" though I have not tested that), and that file is currently live, it will block EVERY ARCHIVED VERSION OF THAT SITE, EVER
05:52 ^🔗	Lord_Nigh	this is a major problem with malicious squatters, who have gone right back to abusing the system like they did 3 years ago
05:54 ^🔗	Lord_Nigh	so this is why i am majorly concerned, especially since the https://twitter.com/TheMogMiner/status/873950228994502658 text implies there may be a legal/court order preventing IA staff from talking about it, let alone fixing it
05:54 ^🔗	xmc	yes
05:54 ^🔗	xmc	i'm going to bed
05:55 ^🔗	Lord_Nigh	and since i haven't heard any noise on slashdot or ycombinator about this, i assume it is a sealed court order, or even possibly an NSL
05:55 ^🔗	Lord_Nigh	which is even creepier
05:56 ^🔗	yipdw	could even be Majestic 12 influence
05:56 ^🔗	yipdw	🤔
05:56 ^🔗	pikhq	I'd be inferring something to do with legal system bustedness, that's for sure.
05:58 ^🔗	yipdw	lol that tweet
05:58 ^🔗	Frogging	really doubt it's an NSL, I'm sure there are channels for that and they do not involve robots.txt
05:58 ^🔗	yipdw	inferring conspiracies from "not allowed to comment"
05:59 ^🔗	yipdw	Warren Spector missed the mark on Deus Ex
05:59 ^🔗	pikhq	Yeah, I'm loving that they're inferring anything more involved than "something something legal system".
06:00 ^🔗	pikhq	(hell, it might not even be a sealed court order, just a court order we're ignorant of, and "not allowed to comment" is more "IA's lawyers would hate me for saying anything")
06:01 ^🔗	pikhq	(or a settlement, or, well. Anything, really. Not much to go off of.)
06:01 ^🔗	Lord_Nigh	sealed settlement would suck since that means it can never change again
06:02 ^🔗	Lord_Nigh	I'm probably reading way too much into those comments
06:02 ^🔗	pikhq	I'd not be confident reading much more into it than "there is probably a lawyer involved in this".
06:21 ^🔗	tsp_	I started uploading misc_website_rips with ia, but it stopped here: error uploading The Fanfic Vault.boe: We encountered an internal error. Please try again. - uploadItem.py: uploading id = misc_website_rips using contribSubmit!
06:22 ^🔗	tsp_	I didn't see anything in the docs about what to do next, most importantly, will it resume if I run the same command again with the list of previously-uploaded files?
06:59 ^🔗		SHODAN_UI has joined #archiveteam-bs
07:04 ^🔗		schbirid has joined #archiveteam-bs
07:50 ^🔗		j08nY has joined #archiveteam-bs
08:11 ^🔗		BlueMaxim has quit IRC (Quit: Leaving)
09:07 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
10:00 ^🔗		pizzaiolo has joined #archiveteam-bs
10:39 ^🔗		j08nY has quit IRC (Read error: Operation timed out)
11:22 ^🔗		SHODAN_UI has joined #archiveteam-bs
12:02 ^🔗		j08nY has joined #archiveteam-bs
12:51 ^🔗		dcmorton has quit IRC (Quit: ZNC - http://znc.in)
12:56 ^🔗		dcmorton has joined #archiveteam-bs
13:45 ^🔗		Odd0002 has quit IRC (Remote host closed the connection)
15:20 ^🔗	jrwr	tsp_: correct it will resume
15:22 ^🔗	jrwr	"If you think the Archive is a bad actor you are 12 pineapples short of a luau"
15:22 ^🔗	jrwr	Word-Smithing at its finest
15:35 ^🔗		BartoCH has quit IRC (Ping timeout: 260 seconds)
15:46 ^🔗		BartoCH has joined #archiveteam-bs
15:52 ^🔗		pizzaiolo has quit IRC (Quit: pizzaiolo)
16:11 ^🔗		Pudsey has joined #archiveteam-bs
16:27 ^🔗	bsmith093	tapedrive: if i could make a suggestion, archive.org/details/fanfictiondotnet_repack much better version than one giant tar file. that was my bad.
16:30 ^🔗	tapedrive	I've actually downladed that (the zip of the entire item) a few weeks ago, and I'll be integrating that into my archive and database - along with the 2012 scrapes - which will allow us to see how many stories we're actually missing overall.
16:32 ^🔗	tapedrive	And also thanks for doing that scrape, bsmith093, it's been a lifesaver to me many times
16:39 ^🔗		Pudsey has quit IRC (Remote host closed the connection)
16:42 ^🔗		pizzaiolo has joined #archiveteam-bs
16:48 ^🔗		tuluu has joined #archiveteam-bs
16:59 ^🔗		tuluu has quit IRC (Ping timeout: 260 seconds)
17:01 ^🔗		tuluu has joined #archiveteam-bs
17:13 ^🔗		mls_ has quit IRC (Read error: Connection reset by peer)
17:14 ^🔗	godane	https://archive.org/details/godaneinbox?sort=-publicdate&&and[]=subject%3A%22Sports%20Illustrated%22
17:14 ^🔗		mls_ has joined #archiveteam-bs
17:25 ^🔗	JAA	bsmith093: Although you're right in that several smaller files are easier to handle, you can browse the contents of tar and zip archives on the IA and only download individual files from it. For example, https://archive.org/download/fanfictiondotnet_repack/Fanfiction_Q.zip/ (Don't attempt to do this with the larger zips though; my browser was not amused.)
17:31 ^🔗		Smiley has joined #archiveteam-bs
17:34 ^🔗	MrRadar	arkiver: It looks like the Imzy project doesn't work for the Warrior VM since the www.imzy.com domain requires TLS 1.2
17:35 ^🔗		Simpbrain has quit IRC (Ping timeout: 506 seconds)
17:37 ^🔗	arkiver	ah...
17:40 ^🔗		pizzaiolo has quit IRC (Quit: pizzaiolo)
18:08 ^🔗		icedice has joined #archiveteam-bs
18:09 ^🔗	tsp_	jrwr: My item is unavailable due to content issues, not sure why. It looks like i'll have to email IA to find out why, maybe it didn't like one of the files.
18:11 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
18:23 ^🔗	PurpleSym	tsp_: You can replace /details/ with /history/ in the item’s URL. Usually there is a reason in the logs.
18:29 ^🔗	tsp_	PurpleSym: I'm there (http://archive.org/history/misc_web_rips). Doesn't seem like there are any tasks, though I can't tell if "server readonly -- tasks waiting for harddrive fix" is a task or not.
18:30 ^🔗	PurpleSym	That item does not exist.
18:32 ^🔗	PurpleSym	You probably meant misc_website_rips: https://catalogd.archive.org/log/682094076
18:32 ^🔗	tsp_	Sorry, yeah, just noticed that.
18:37 ^🔗	tsp_	Antivirus got it. Is there anything I can do about it now that I know what the problem is?
18:37 ^🔗	tsp_	PurpleSym: Thanks, didn't know about history.
18:37 ^🔗	MrRadar	Your best bet is probably to create a new item without the offending data
18:42 ^🔗	tsp_	MrRadar: With the understanding that any new sites I add might disable the new item due to malware?
18:42 ^🔗	MrRadar	You might want to switch to 1 item per site then
18:44 ^🔗	tsp_	The biggest issue with that is coming up with identifiers.
18:45 ^🔗	MrRadar	sitename_grabdate ?
18:45 ^🔗		tuluu has quit IRC (Remote host closed the connection)
18:47 ^🔗		tuluu has joined #archiveteam-bs
18:47 ^🔗	tsp_	That might work, if i can get grabdates out of these files.
19:02 ^🔗	JAA	arkiver: We may want to reduce the Imzy rate limit a bit. I see tons of timeouts.
19:02 ^🔗	JAA	And 504s
19:22 ^🔗	JAA	Better now
19:30 ^🔗	bsmith093	tapedrive: i also just uploaded this one recently https://archive.org/details/Fanfictiondotnet1011dump
19:30 ^🔗	bsmith093	roughly 750k more stories, with a metadata db using the same schema
19:30 ^🔗	tapedrive	bsmith093: I'm extracting that one now ;) (taking forever on a 256mb Raspberry Pi XD )
19:31 ^🔗	tapedrive	Thanks for your great work!
19:32 ^🔗	bsmith093	im also doing fictionpress and ao3 in that format, but as a simple scrape like the others. Redudancy is good, though, and i only started this because it was really easy to give fanficfare a giant list.
19:33 ^🔗	tapedrive	I'm doing it in the same format as you to make it easy to merge them (which is what I'm doing now) and work out how many are actually missing
19:34 ^🔗	bsmith093	tapedrive: to save some time, 130k of the first million stories on ffnet are all thats left of the oldest group.
19:34 ^🔗	tapedrive	Yeah - I have a mysql database of all of them now
19:35 ^🔗	tapedrive	Saying if they've been archived, deleted, or still need to be archived
19:35 ^🔗	tapedrive	And a script that runs looking at fanfiction.net for new stories and updates, and updates the database accordingly.
19:35 ^🔗	tapedrive	Scrapers query the databse and get new stories to archive/rearchive
19:42 ^🔗	tsp_	I have a fanfiction.net db I've been collecting for a while, kind of an oddball format but it can be parsed. If someone can generate a list of story ids that are missing, I can go through and check which ones I have.
19:43 ^🔗	tapedrive	In a few days (hopefully) I'll have a list of all the ones that I don't have (which will include the 2012 scrape and bsmith093's multiple scrapes)
19:44 ^🔗	nightpool	what's going on with fanfiction.net?
19:44 ^🔗	tapedrive	Nothing, yet.
19:45 ^🔗	tapedrive	But as far as I can find out the site's run by either a single person or a small team, that never reply to emails, tweets or comments.
19:45 ^🔗	tapedrive	And every month the site has different problems, which sometimes get fixed, and more often don't.
19:46 ^🔗	tsp_	I mostly collect Harry Potter, but a few others got in as well.
19:47 ^🔗	nightpool	oh, yeah, I mean, it's good that it's being archived.
19:47 ^🔗	nightpool	I was just wondering if there was a proximate cause to discussion
19:48 ^🔗	tapedrive	It's only come up because my crawl has finished recently. Nothing's happend with the site
19:50 ^🔗		medowar has joined #archiveteam-bs
19:55 ^🔗		wp494_ has joined #archiveteam-bs
19:55 ^🔗		luckcolor has quit IRC (Read error: Operation timed out)
19:55 ^🔗		dboard has quit IRC (Read error: Operation timed out)
19:55 ^🔗		luckcolor has joined #archiveteam-bs
19:56 ^🔗		Petri152 has quit IRC (Read error: Operation timed out)
19:56 ^🔗		dboard has joined #archiveteam-bs
19:56 ^🔗		JAA has quit IRC (Read error: Operation timed out)
19:56 ^🔗		Lord_Nigh has quit IRC (Read error: Operation timed out)
19:56 ^🔗		ndiddy has quit IRC (Read error: Operation timed out)
19:56 ^🔗		K4k has quit IRC (Read error: Operation timed out)
19:56 ^🔗		mundus20- has quit IRC (Write error: Broken pipe)
19:56 ^🔗		aschmitz has quit IRC (Read error: Operation timed out)
19:56 ^🔗		mhazinsk has quit IRC (Read error: Operation timed out)
19:56 ^🔗		jrwr has quit IRC (Read error: Operation timed out)
19:56 ^🔗		will has quit IRC (Read error: Operation timed out)
19:56 ^🔗		C4K3 has quit IRC (Read error: Operation timed out)
19:56 ^🔗		ZexaronS has quit IRC (Read error: Operation timed out)
19:57 ^🔗		Stilett0 has quit IRC (Read error: Operation timed out)
19:57 ^🔗		Stilett0 has joined #archiveteam-bs
19:57 ^🔗		K4k has joined #archiveteam-bs
19:57 ^🔗		PotcFdk has quit IRC (Read error: Operation timed out)
19:58 ^🔗		mundus201 has joined #archiveteam-bs
19:58 ^🔗		rocode has quit IRC (Read error: Operation timed out)
19:58 ^🔗		rocode has joined #archiveteam-bs
19:58 ^🔗		jrwr has joined #archiveteam-bs
19:58 ^🔗		ZexaronS has joined #archiveteam-bs
19:58 ^🔗		ndiddy has joined #archiveteam-bs
19:58 ^🔗		mhazinsk has joined #archiveteam-bs
19:59 ^🔗		Lord_Nigh has joined #archiveteam-bs
20:00 ^🔗		will has joined #archiveteam-bs
20:00 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
20:01 ^🔗		PotcFdk has joined #archiveteam-bs
20:02 ^🔗		JAA has joined #archiveteam-bs
20:02 ^🔗		Petri152 has joined #archiveteam-bs
20:03 ^🔗		C4K3 has joined #archiveteam-bs
20:14 ^🔗		SHODAN_UI has joined #archiveteam-bs
20:39 ^🔗	tsp_	How should I turn a URL into an identifier? I can tell which page I started from, for example: http://home.att.net/~polliwog-press/pollistoryindex.htm
20:41 ^🔗	tsp_	If I do some manual editing, I can get it to home.att.net__polliwog-press_2009-12-12, however I'm not sure if I should turn ~ into _ or remove it altogether to avoid the double _.
20:43 ^🔗	xmc	archivebot turns it into an item named after just the hostname and the date of capture
20:43 ^🔗	xmc	(and the pipeline)
20:43 ^🔗	xmc	so you could call it archiveteam-ondemand_home.att.net_2016-06-12 or whatever
20:44 ^🔗	tsp_	There might be a few home.att.net rips in my collection, of different people. They're each in one .boe file.
20:45 ^🔗	xmc	archivebot sometimes puts multiple crawl jobs in the same IA item, which isn't an issue
20:46 ^🔗		th1x has joined #archiveteam-bs
20:46 ^🔗	tsp_	I tried to put all my backups in one item, but the antivirus scanner found something and killed the entire item instead of the one file.
20:47 ^🔗	xmc	huh
20:47 ^🔗	xmc	email info@archive.org and tell them it's a false positive
20:47 ^🔗	tsp_	Maybe it's not. But I'd rather just have that one file excluded rather than the entire thing.
20:47 ^🔗	tsp_	I'll email, thanks.
20:52 ^🔗	tsp_	xmc: But as someone else said, it might be a better idea to upload all of them as individual items. But I'm finding that kind of difficult to work with atm.
21:05 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:07 ^🔗	Lord_Nigh	https://archive.org/about/faqs.php <- some of the content removal requests in the forum at the bottom of that page are weird... one page was about what looks like a foreign-langauge 'carfacts' equivalent site for a specific vehicle, which was not actually taken down...
21:07 ^🔗	Lord_Nigh	guessing someone was trying to hide some prior damage or something
21:07 ^🔗	Lord_Nigh	and the archive.org admins wisely decided to ignore the request
21:09 ^🔗	Lord_Nigh	huh, this might be related to the robots.txt thing: https://archive.org/post/1074464/robotstxt-processing-failure although that might just be 'collateral damage' from the recent change and not a bug?
21:09 ^🔗	Lord_Nigh	i.e. if www.example.com and example.com serve different robots.txt files, both will use the one from www.example.com
22:04 ^🔗		SHODAN_UI has quit IRC (Remote host closed the connection)
22:05 ^🔗		wp494_ is now known as wp494
22:14 ^🔗		Gilfoyle has quit IRC (Read error: Operation timed out)
22:54 ^🔗		BlueMaxim has joined #archiveteam-bs
23:13 ^🔗		nyany has quit IRC (Remote host closed the connection)
23:27 ^🔗	jrwr	yipdw: you around by any chance
23:28 ^🔗		nyany has joined #archiveteam-bs
23:56 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)

irclogger-viewer