Time |
Nickname |
Message |
00:04
π
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
01:24
π
|
|
tech234a has joined #urlteam |
01:54
π
|
Somebody2 |
Thanks, VADemon. Let me know if you have questions, tech234a |
02:45
π
|
Flashfire |
VADemon you said you were a native ΡΡΡΡΠΊΠΈΠΉ speaker |
02:45
π
|
Flashfire |
yes? |
02:46
π
|
VADemon |
correct |
02:49
π
|
Flashfire |
my problem is fitting http://ΡΠΎΡ.ΡΡ/ and http://ΡΡΡ.ΡΡ/ into the alive section that is found here https://www.archiveteam.org/index.php?title=URLTeam#Alive |
02:49
π
|
Flashfire |
However It needs to be in alphabetical order |
02:49
π
|
Flashfire |
This is where I run into my problem VADemon |
02:49
π
|
Flashfire |
I dont care if the only info you put in for the comments section is that they are alive |
02:50
π
|
Flashfire |
I just want them put in alphabetically |
02:50
π
|
Flashfire |
and dont even get me started on fucking emoji URLs |
02:50
π
|
Flashfire |
WHOS IDEA WAS THAT |
02:51
π
|
VADemon |
keep cyrillic alphabet after the latin, and ΡΡΡ.ΡΡ comes first then ΡΠΎΡ.ΡΡ |
02:52
π
|
Flashfire |
Ok then |
02:52
π
|
Flashfire |
thanks |
02:52
π
|
Flashfire |
I will get to that after I finish sorting a few video lists out unless Somebody2 you want to get that down |
02:52
π
|
VADemon |
btw mediawiki should allow you to sort the column without first clicking on it |
02:53
π
|
Flashfire |
VADemon if you could also find out which Cyrillic Script those URL shorteners use and come back to us with a list to add that to the tracker I will buy you a game on steam that is under $5 I would be so happy |
02:53
π
|
VADemon |
BUT dont ask me how any of MWiki works. its worse than ancient glyphs |
02:54
π
|
Flashfire |
Thats fine lol |
03:18
π
|
|
Zerote has quit IRC (Ping timeout: 260 seconds) |
03:46
π
|
VADemon |
ΡΠΎΡ.ΡΡ: 0-9Π°-ΡΠ-Π― 3-chars sequential (seems to be sequential, but the number they claim 88k doesnt add up) |
03:46
π
|
VADemon |
Flashfire: not found is 302 to /404.html, found is 302 to link, link deduplication, HEAD requests work |
03:47
π
|
Flashfire |
Ok I will add that to the wiki when/if I get a chance I have other stuff on my plate first |
03:51
π
|
VADemon |
Very cool of ΡΡΡ.ΡΡ, gotta be an oldfag - the descriptions are lovely. And they also tell what they use |
03:55
π
|
Flashfire |
Which is? I dont read russian |
03:57
π
|
VADemon |
ΡΡΡ.ΡΡ: 0-9Π°-ΡΠ-Π―-_ link deduplication, case sensitive, 200 on not found, 200 on found - redirect via JS location. link are random, user-specified links are min=9 in length |
03:59
π
|
VADemon |
example generated: http://ΡΡΡ.ΡΡ/Π£ΡΠ«/ ex. user: http://ΡΡΡ.ΡΡ/ΠΊΠ»Π°ΡΡΠ½ΡΠΉ-ΡΠΎΡΡΠ΅Π½Ρ-ΠΊΠ»ΠΈΠ΅Π½Ρ/ |
04:35
π
|
|
Hani111 has joined #urlteam |
04:44
π
|
|
Hani has quit IRC (Ping timeout: 615 seconds) |
04:44
π
|
|
Hani111 is now known as Hani |
05:22
π
|
|
warmwaffl has quit IRC (Remote host closed the connection) |
05:31
π
|
|
cats_ has joined #urlteam |
05:32
π
|
cats_ |
who decides what gets added to the urlteam scrape list |
05:32
π
|
cats_ |
and who also decided that what could essentially be considered an attack by blasting random incrementing strings at a server was ever a good idea |
05:37
π
|
Flashfire |
Sorry whats wrong? |
05:37
π
|
Flashfire |
cats_ do you have a problem you wish to address? |
05:37
π
|
cats_ |
I own catbox.moe, which you're "scraping" the now defunct qt.catbox.moe |
05:38
π
|
Flashfire |
Somebody2 |
05:38
π
|
Flashfire |
cats_ I can pause the tracker but we are doing it to try and preserve info. Is there any way of getting a database dump of the urls then cats_ |
05:39
π
|
Flashfire |
Pause the tracker for catbox.moe |
05:40
π
|
Flashfire |
Arkiver Chfoo JAA Somebody2 Hcross someone with more authority needed pronto |
05:40
π
|
Flashfire |
cats_ I can pause the tracker on catbox for a little while but are you willing to negotiate? |
05:41
π
|
JAA |
Who decided that hosting a service that would cause thousands of links to die when (not if) it eventually goes down despite the actual target URLs still being perfectly fine was ever a good idea? |
05:41
π
|
cats_ |
I read your "appeal" page already and I've already had interactions with someone who was affiliated with Archivteam before (wubthecaptain) |
05:42
π
|
JAA |
cats_: If we're causing issues for you, we can reduce the request rate. But the ideal solution would be if you could produce a database dump, then we don't have to scrape at all. |
05:42
π
|
cats_ |
i think you're misunderstanding the point here - these url's aren't yours to collect |
05:42
π
|
cats_ |
they contain potentially private information |
05:42
π
|
JAA |
Then they probably shouldn't be publicly accessible. |
05:43
π
|
cats_ |
okay then |
05:43
π
|
cats_ |
here's the deal |
05:44
π
|
cats_ |
either stop completely, or i'm just going to wipe the DB |
05:44
π
|
cats_ |
https://files.catbox.moe/mg974m.png |
05:44
π
|
Flashfire |
I can certainly slow it down for you |
05:45
π
|
cats_ |
as i said before, the shortener is a now defunct service. if you want the *publicly accessible links*, i recommend you use things which scrape the internet for those links, like... google? |
05:46
π
|
Flashfire |
Ok currently pausing the tracker |
05:46
π
|
cats_ |
and, as it says on the front page (even though I know you enjoy meme'ing about "when not if"), the links will stay accessible for the present future |
05:46
π
|
Flashfire |
How long is the present future lasting till? |
05:46
π
|
JAA |
I have no reason not to believe you on that, but shit happens. Bus factor etc. |
05:46
π
|
cats_ |
not to mention i wasn't contacted previously |
05:46
π
|
cats_ |
as in, at all |
05:47
π
|
cats_ |
(sans wubthecaptain, 3 years ago) |
05:47
π
|
Flashfire |
Are you willing to give a database dump then cats_ If we promise not to scrape? |
05:48
π
|
cats_ |
maybe you should be more concerned about other things, like the *other* file host that just shut down with over 4 million files |
05:48
π
|
cats_ |
although i know that's not this channel's forte |
05:48
π
|
JAA |
Mixtape.moe? We archived what we could of that. |
05:49
π
|
JAA |
But yes, not part of this particular project. We're focusing on URL shorteners here. |
05:49
π
|
cats_ |
tell you what, since you've been at least cordial with me (except JAA), i'll make a deal |
05:50
π
|
cats_ |
i'll go through my access logs for the past... 3 years, and i'll give you every link that has a referrer (meaning it was linked somewhere on the internet) |
05:50
π
|
|
icedice has joined #urlteam |
05:51
π
|
cats_ |
alternatively, I'll give you ever anonymous shortened URL |
05:51
π
|
Flashfire |
Thats the entire length of the services life? |
05:51
π
|
cats_ |
i urge you to take the second option. |
05:52
π
|
Flashfire |
No chance of both options? Not that I am in a position to bargain but out of curiosity? If I cant have both options I think the second option sounds best |
05:52
π
|
cats_ |
catbox just turned 4... link shortener became active at the end of 2016, so about |
05:53
π
|
cats_ |
https://files.catbox.moe/4rlg87.png |
05:53
π
|
Flashfire |
So if we can take both of those options together that would be the best but if thats not possible then just the second option |
05:53
π
|
cats_ |
9300 is better than nothing |
05:53
π
|
cats_ |
(anon url's) |
05:54
π
|
cats_ |
9500* |
05:54
π
|
Flashfire |
Combined with what we already have I think that if we cant take both of your options together then the anon urls are the option we will take |
05:55
π
|
Flashfire |
Where would we be able to pick that information up from? |
05:55
π
|
cats_ |
you can't have gotten very much if you're only at 2m*** |
05:55
π
|
cats_ |
also, you were scraping 5 characters, and the shorts are 6 |
05:57
π
|
cats_ |
give me a couple minutes to confer |
05:57
π
|
Flashfire |
alright then |
05:59
π
|
|
jornane has joined #urlteam |
06:00
π
|
|
jornbaer has quit IRC (Quit: Vedlikehold) |
06:00
π
|
|
agris has joined #urlteam |
06:14
π
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |
06:58
π
|
cats_ |
sorry, i had to step away for a little bit longer than expected |
06:58
π
|
Flashfire |
All Good |
06:59
π
|
cats_ |
i've decided that for right now, dumping the anonymous URLs is the best option |
07:00
π
|
cats_ |
compiling the referrer links will take time |
07:00
π
|
Flashfire |
Alright |
07:00
π
|
Flashfire |
Where will these be dumped? |
07:00
π
|
cats_ |
and of course, we're *both* concerned with time |
07:00
π
|
Flashfire |
Yes |
07:00
π
|
cats_ |
do you want it in an .sql file or |
07:00
π
|
cats_ |
json |
07:01
π
|
cats_ |
csv |
07:01
π
|
Flashfire |
Would I be able to get one of each or wait until one of the other operators comes back? |
07:02
π
|
cats_ |
i'm fine with waiting for another op |
07:02
π
|
Flashfire |
Somebody2 JAA chfoo HCross arkiver little help here again please |
07:02
π
|
Flashfire |
thank you |
07:09
π
|
JAA |
cats_: The file format doesn't matter very much as long as it's clear what it contains. I guess it'll just be shortcode + URL, or do you keep any further relevant information? (One service that gave me their file had some spam check flag, for example.) Personally, I'd prefer JSON or CSV over SQL. |
07:12
π
|
JAA |
We might want to run these short URLs through ArchiveBot if that's ok with you to get both the redirect and the target page into the Wayback Machine. I see that we sent 2.4 million requests your way already, so I guess the ~9k won't matter too much. It'll also be a much slower rate I think, and we can slow it down further if needed. |
07:12
π
|
cats_ |
there's a date and potentially malicious bool |
07:12
π
|
cats_ |
90~% of the potentially malicious flagged ones are other shorteners though |
07:13
π
|
JAA |
I see, would be great if you could include those fields as well. |
07:13
π
|
cats_ |
i had people layering shortners to disguise ip grabbers |
07:13
π
|
cats_ |
sometimes 5-6 levels |
07:14
π
|
cats_ |
also the rate wasn't a problem necessarily (nginx was panicking a bit) |
07:14
π
|
cats_ |
it's just getting thousands of requests that are consistently 404 and seemingly programatic is... suspicious to say the least |
07:14
π
|
cats_ |
malicious, to say the worst - in addition to not being contacted beforehand |
07:16
π
|
cats_ |
do you want the incremental ID #'s as well or should i just drop that column |
07:23
π
|
Flashfire |
cats_ if we slow it down a lot and now that you know its us could we resume the grab at a severly reduced rate? Or is that asking too much forgive me for not having as much experience in negotiatiomn |
07:23
π
|
cats_ |
there's some phrase that goes "give an inch, they take a mile" that applies here |
07:24
π
|
JAA |
Yeah, understandable. (Sorry for the slow responses by the way. I'm on a train and the connection is... not so great.) |
07:24
π
|
Flashfire |
Ignore me in that case cats_ |
07:24
π
|
cats_ |
i'm not sure what those 3863 urls are worth to you |
07:25
π
|
cats_ |
compared to the 9466 anon |
07:25
π
|
JAA |
Incremental ID sounds potentially useful. |
07:25
π
|
JAA |
3863? |
07:26
π
|
Flashfire |
The ones we wont get JAA |
07:26
π
|
cats_ |
13329 (total) - 9466 (anon) = 3863 (associated with a user) |
07:26
π
|
JAA |
Ah |
07:28
π
|
Flashfire |
cats_ they arent worth losing out on our deal but if possible we would like them as well. Understandable that you dont want to give them though |
07:31
π
|
cats_ |
well, you have 797,448,960 possible urls to scan |
07:32
π
|
JAA |
The most interesting ones of those would be the ones linked somewhere on the web. So I still like the referrer idea. Would be interesting also how many are remaining afterwards. |
07:44
π
|
cats_ |
ah whatever |
07:45
π
|
cats_ |
i was giving you guys a hard time earlier because i was just upset but i don't really care at this point |
07:45
π
|
cats_ |
give me a minute to anonymize the table |
07:45
π
|
Flashfire |
Our apologies cats_ we only had good intentions didnt mean to upset you at all friend |
07:46
π
|
Flashfire |
!a https://www.youtube.com/channel/UCHai12P6Gh7PaIYZGnzyrSA |
07:46
π
|
Flashfire |
Ignore that |
08:03
π
|
cats_ |
https://qt.catbox.moe/shortenedurls.json |
08:04
π
|
Flashfire |
JAA can you download that I dont have enough space on my computer for extra files |
08:04
π
|
Flashfire |
I have like 2GB left |
08:04
π
|
Flashfire |
If that and dont want to accidentally delete this |
08:05
π
|
cats_ |
>2gb |
08:05
π
|
cats_ |
wh |
08:05
π
|
Flashfire |
Because the laptop I am using at the moment is a 2011 Macbook Pro which I am using until it dies and I have to upgrade |
08:06
π
|
cats_ |
that's... impressive to say the least |
08:06
π
|
Flashfire |
What that is still running? |
08:06
π
|
cats_ |
everything about it |
08:07
π
|
cats_ |
anyway, that's everything |
08:07
π
|
Flashfire |
Everything is original except for the hard drive in it which is a 250GB SSD |
08:07
π
|
cats_ |
would you be upset if i deleted the service now? |
08:07
π
|
cats_ |
;o |
08:08
π
|
JAA |
Thanks a lot! |
08:08
π
|
|
legoktm has joined #urlteam |
08:08
π
|
Flashfire |
I mean I am upset every time I hear about data loss but like it happens. Just give us some time to download the file and sort |
08:08
π
|
cats_ |
the link shortener was nothing but a pain anyway. 80% malicious 20% real |
08:08
π
|
cats_ |
at least with a file host you can weed out the shit |
08:09
π
|
JAA |
Only a little since I'd like to throw it into ArchiveBot first so the redirects will work in the Wayback Machine. But at least the list won't be lost. :-) |
08:09
π
|
cats_ |
i was kitten. i'm still going to keep the service up |
08:09
π
|
JAA |
I see a lot of files.catbox.moe URLs in there. As if those URLs weren't short enough already. |
08:10
π
|
cats_ |
some people don't like extensions |
08:10
π
|
Flashfire |
Cheers. (does this mean I can turn on the tracker again so people can stop complaining about no work units ;P) |
08:10
π
|
cats_ |
\/shrug |
08:10
π
|
JAA |
Flashfire: Please do, but add another shortener instead. :-) |
08:10
π
|
cats_ |
i'm sure the work units were being done incredibly fast since i was 444'ing them for the 30 minutes before i came here |
08:10
π
|
JAA |
Plenty of other work to do. |
08:11
π
|
Flashfire |
I will look at what else I can put in. (People will complain regardless) |
08:12
π
|
cats_ |
If you need anything else, you know where to get me https://catbox.moe/contact.php |
08:12
π
|
Flashfire |
I would really like to see Linktree done but I dont have the knowledge to put that one through |
08:12
π
|
Flashfire |
Thanks very much |
08:12
π
|
cats_ |
and for the future, you guys should knock before you start busting doors ;v |
08:13
π
|
cats_ |
peace peace |
08:16
π
|
Flashfire |
JAA when are we starting on the google short urls? |
08:21
π
|
JAA |
Flashfire: Right. I was thinking it would be nice to switch to WARC output first. But on the other hand, it might be years before that happens. |
08:22
π
|
Flashfire |
I mean I dont have the coding knowledge required and am to busy to pick it up |
08:22
π
|
JAA |
I think we could easily do it with a separate warrior project, but I'm not sure whether we want to do that. |
08:23
π
|
Flashfire |
After we assfucked google for google plus I think we should avoid another warrior project against google at least for a month |
08:23
π
|
JAA |
There's only one default project, so we can't split up the workers between URLTeam and such a separate effort. |
08:24
π
|
JAA |
Or we should continue immediately since they're used to it already. :-) |
08:24
π
|
Flashfire |
lol |
08:25
π
|
Flashfire |
Its not worth adding bitly aliases to the tracker is it? |
08:29
π
|
JAA |
Nope. We can be happy if we ever get close to complete bit.ly coverage. |
08:30
π
|
Flashfire |
Alright |
08:31
π
|
|
cats_ has quit IRC (Quit: Page closed) |
08:31
π
|
|
MR9K has joined #urlteam |
08:32
π
|
|
cats_ has joined #urlteam |
08:32
π
|
cats_ |
oops, i see you readded catbox |
08:32
π
|
cats_ |
i'm blocking your project, give me a second |
08:32
π
|
cats_ |
i was* |
08:32
π
|
Flashfire |
oh sorry |
08:32
π
|
cats_ |
you're just getting 444's for eveyrthing |
08:32
π
|
JAA |
Did we? That was not intended. |
08:33
π
|
Flashfire |
So ill stop it again? |
08:33
π
|
JAA |
444 is just connection closures, right? |
08:33
π
|
Flashfire |
I got confused |
08:33
π
|
JAA |
Yes |
08:33
π
|
JAA |
Sorry, I wasn't clear above. |
08:33
π
|
cats_ |
that you are |
08:33
π
|
cats_ |
https://files.catbox.moe/izw8c2.png |
08:35
π
|
Flashfire |
Thats ok I disabled it again |
08:36
π
|
cats_ |
are you going to scrape just the ones i gave you? |
08:36
π
|
Flashfire |
No idea at this point |
08:38
π
|
cats_ |
if you just need your slaves to work out the last couple work units you can turn it back on then |
08:38
π
|
cats_ |
they'll still 444 |
08:38
π
|
JAA |
Yeah, I'll run the ones you gave me through ArchiveBot. We won't be grabbing anything further from catbox.moe through this project. |
08:40
π
|
|
Smiley has quit IRC (Read error: Operation timed out) |
08:40
π
|
Flashfire |
https://ffm.to is a shortener I just found through instagram |
08:41
π
|
Flashfire |
Adding it to my dump for now will integrate into main list later |
08:45
π
|
cats_ |
as long as archivebot doesn't have the same useragent |
08:47
π
|
|
Smiley has joined #urlteam |
08:59
π
|
JAA |
It doesn't. I don't remember the full one, but it starts with "ArchiveTeam ArchiveBot/". |
09:13
π
|
|
justas has joined #urlteam |
10:22
π
|
|
cats_ has quit IRC (Quit: Page closed) |
10:24
π
|
|
SilSte has joined #urlteam |
10:34
π
|
|
Zerote has joined #urlteam |
12:51
π
|
|
justas is now known as jut |
14:22
π
|
|
mtntmnky_ has joined #urlteam |
14:24
π
|
|
mtntmnky has quit IRC (Remote host closed the connection) |
15:44
π
|
|
tech234a has joined #urlteam |
16:19
π
|
marked |
what's the advantage if urlteam switched to .warc format? |
17:27
π
|
Somebody2 |
Sigh -- this is why I wanted to try and *find out* if there were any 4 or 5 character shortcodes. |
17:28
π
|
Somebody2 |
Also, why it's better to actually check if the owners of shortening services are acessible first. |
17:28
π
|
Somebody2 |
In any case, there's no need to run the job further. |
17:32
π
|
Somebody2 |
and clearing the invalid ones out of dlvr-it again too |
17:37
π
|
|
Terbium has quit IRC (Quit: Terbium) |
17:37
π
|
|
Terbium has joined #urlteam |
17:53
π
|
|
Terbium has quit IRC (Quit: Terbium) |
17:56
π
|
|
Terbium has joined #urlteam |
18:29
π
|
tech234a |
Here are some URLs I discovered on URL shorteners using my script. My script seems to struggle with less common URL shorteners. https://drive.google.com/open?id=1F-9LAj_jOs1JV_PTxU91n98ARr-LWEcP |
18:34
π
|
tech234a |
LMK what else to run it against |
19:04
π
|
|
morgan_ has joined #urlteam |
19:39
π
|
|
Zerote has quit IRC (Ping timeout: 260 seconds) |
20:28
π
|
|
icedice2 has joined #urlteam |
20:28
π
|
|
icedice has quit IRC (Ping timeout: 252 seconds) |
20:35
π
|
|
icedice2 has quit IRC (Quit: Leaving) |
20:46
π
|
JAA |
marked: The short URL redirects would be available in the Wayback Machine. And we could also easily record that a certain shortcode did not exist at a certain time (by keeping the 404 or whatever the service replies with in the WARCs). |
20:51
π
|
|
odemg has joined #urlteam |
21:18
π
|
|
Zerote has joined #urlteam |
22:43
π
|
Somebody2 |
Nice! |
23:34
π
|
|
tech234a has quit IRC (Quit: Connection closed for inactivity) |